Question regarding the MoreLikeThis features

Marco D'Ambra Thu, 10 Mar 2022 09:18:20 -0800

Hi all,
This is my first time writing to this mailing list and I would like to thank 
you in advance for your attention.
I am writing because I am having problems using the "MoreLikeThis" features.
I am working in a Solr cluster (version 8.11.1) consisting of multiple nodes, 
each of which contains multiple shards.


It is a quite big cluster and data is sharded using implicit routing and 
documents are distributed by date on monthly shards.

Here are the fields that I'm using:

  *   UniqueReference: the unique reference of a document
  *   DocumentDate: the date of a document (in the standar Solr format)
  *   DataType: the data type of the document (let's say that can be A or B)
  *   Content: the content of a document (a string)
Here is what my managed schema looks like
...
<field name="UniqueReference" type="string" indexed="true" stored="true" 
required="true" />

<field name="DocumentDate" type="pdate" indexed="true" stored="false" 
required="true" />

<field name="DataType" type="string" indexed="true" stored="false" 
required="true" />

<field name="Content_en" type="text_en" indexed="true" stored="true" 
required="false" />
...


The task that I want to perform is the following:
Given the unique reference of a document of type A, I want to find the 
documents of data type B and in a fixed time interval, that have the most 
similar content.
Here the first questions:

  1.  Which is the best solr request to perform this task?
  2.  Is there a parameter that allows me to restrict the corpus of documents 
that are analyzed for the return of similar contents? it should be noted that 
this corpus of documents may not contain the initial document from which I am 
starting
Initially I thought about using the "mlt" endpoint, but since there was no 
parameter in the documentation that would allow me to select the shard on which 
to direct the query (I absolutely need it, otherwise I risk putting a strain on 
my cluster), I opted to use the "select" endpoint, with the "mlt" parameter set 
to true, and the "shards" parameter.
Those are the parameters that I am using:

  *   q: "UniqueReference:doc_id"
  *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z] AND 
DataType:B) OR (UniqueReference:doc_id)"
  *   mlt: true
  *   mlt.fl: "Content"
  *   shards: "shard_202201"
I realize that the "fq" parameter is used in a bizarre way. In theory it should 
be aimed at the documents of the main query (in my case the source document). 
It is an attempt to solve problem (2) (which didn't work, actually).
Anyway, my doubts are not limited to this. What really surprises me is the 
structure of the response that Solr returns to me.
The content of response looks like this:
{
"response" : {
"docs" : [],
...
}
                "moreLikeThis" : ...
                }
The weird stuff appear in the "moreLikeThis" part. Sometimes Solr is returning 
me a list, other times a dictionary. Repeating the same call several times the 
two possibilities occur repeatedly, apparently without a logical pattern, and I 
have not been able to understand why.
And to be precise, in both cases the documents contained in the answer are not 
necessarily of data type B, as requested by me with the "fq" parameter.
In the "dictionary" case, there is only one key, which is the UniqueReference 
of the source document and the corresponding value are similar documents.
In the "list" case, the second element contains the required documents
So, here is the last question:

  1.  I am perfectly aware that I am lost, therefore, what I'm missing?
I thank everyone for the attention you have dedicated to me. Greetings from 
Italy.
I'm available for clarifications,

Marco

Question regarding the MoreLikeThis features

Reply via email to