Re: Question regarding the MoreLikeThis features

Tim Casey Mon, 14 Mar 2022 15:05:07 -0700

Hi,

> Regarding the specific problem on the existence of a specific parameter
to restrict the corpus of documents that are analyzed for the return of
similar contents


If you can get this to be a query, and one which might be ordered in a
useful way, then you are very likely to see what you need in the top 500
results.  This would be enough for most usage.
The 'likely' would need to be computed and measured as you produce results.


In any event, to restrict the corpus you build a query bit set and use that
as a filter.  This is fairly easy to code so you can see the results and
give yourself a way to experiment on what you would do, before deciding
how/what to do any one particular way.

Or, you directly query and allow solr to do the needed computations within
each shard.  At this point, I would recommend people who are more versed in
solr specifics for this kind of computation.

On Mon, Mar 14, 2022 at 12:56 AM Marco D'Ambra <m.dam...@volocom.it> wrote:

> Hi Tim,
>
> thank you very much for the answer, full of useful advice.
> I will try to put into practice what you told me to improve the output of
> the calls.
> Regarding the specific problem on the existence of a specific parameter to
> restrict the corpus of documents that are analyzed for the return of
> similar contents, I must admit that I have not yet figured out how to
> proceed.
>
> Thank you very much and have a nice day,
>
> Marco
>
> -----Original Message-----
> From: Tim Casey <tca...@gmail.com>
> Sent: giovedì 10 marzo 2022 19:51
> To: users@solr.apache.org
> Subject: Re: Question regarding the MoreLikeThis features
>
> Marco,
>
> Finding 'similar' documents will end up being weighted by document length.
> I would recommend, at the point of indexing, also indexing an ordered
> token set of the first 256, 1024 up to around 5k tokens (depending on
> document lengths).  What this does is allow a vector to vector normalized
> comparison.  You could then query for similar possibile documents directly
> and build a normalized vector with respect to the query document.
>
> Normalizing schemes in something like an inverted index will tend to
> weight the lower token count documents over higher token count documents.
> So the above is an attempt to get at a normalized and comparable view
> between documents independent of size.  Next you end up normalizing by the
> inverse of a commonality.  That is, a more common token is weighted lower
> than a least common token.  (I would also discount tokens which have a raw
> frequency below 5.). At the point you have a normalized vector, you can use
> that to find similarities weighted by more meaningful tokens.
>
> tim
>
> On Thu, Mar 10, 2022 at 9:18 AM Marco D'Ambra <m.dam...@volocom.it> wrote:
>
> > Hi all,
> > This is my first time writing to this mailing list and I would like to
> > thank you in advance for your attention.
> > I am writing because I am having problems using the "MoreLikeThis"
> > features.
> > I am working in a Solr cluster (version 8.11.1) consisting of multiple
> > nodes, each of which contains multiple shards.
> >
> > It is a quite big cluster and data is sharded using implicit routing
> > and documents are distributed by date on monthly shards.
> >
> > Here are the fields that I'm using:
> >
> >   *   UniqueReference: the unique reference of a document
> >   *   DocumentDate: the date of a document (in the standar Solr format)
> >   *   DataType: the data type of the document (let's say that can be A or
> > B)
> >   *   Content: the content of a document (a string)
> > Here is what my managed schema looks like ...
> > <field name="UniqueReference" type="string" indexed="true" stored="true"
> > required="true" />
> >
> > <field name="DocumentDate" type="pdate" indexed="true" stored="false"
> > required="true" />
> >
> > <field name="DataType" type="string" indexed="true" stored="false"
> > required="true" />
> >
> > <field name="Content_en" type="text_en" indexed="true" stored="true"
> > required="false" />
> > ...
> >
> >
> > The task that I want to perform is the following:
> > Given the unique reference of a document of type A, I want to find the
> > documents of data type B and in a fixed time interval, that have the
> > most similar content.
> > Here the first questions:
> >
> >   1.  Which is the best solr request to perform this task?
> >   2.  Is there a parameter that allows me to restrict the corpus of
> > documents that are analyzed for the return of similar contents? it
> > should be noted that this corpus of documents may not contain the
> > initial document from which I am starting Initially I thought about
> > using the "mlt" endpoint, but since there was no parameter in the
> > documentation that would allow me to select the shard on which to
> > direct the query (I absolutely need it, otherwise I risk putting a
> > strain on my cluster), I opted to use the "select" endpoint, with the
> "mlt"
> > parameter set to true, and the "shards" parameter.
> > Those are the parameters that I am using:
> >
> >   *   q: "UniqueReference:doc_id"
> >   *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z]
> > AND DataType:B) OR (UniqueReference:doc_id)"
> >   *   mlt: true
> >   *   mlt.fl: "Content"
> >   *   shards: "shard_202201"
> > I realize that the "fq" parameter is used in a bizarre way. In theory
> > it should be aimed at the documents of the main query (in my case the
> > source document). It is an attempt to solve problem (2) (which didn't
> > work, actually).
> > Anyway, my doubts are not limited to this. What really surprises me is
> > the structure of the response that Solr returns to me.
> > The content of response looks like this:
> > {
> > "response" : {
> > "docs" : [],
> > ...
> > }
> >                 "moreLikeThis" : ...
> >                 }
> > The weird stuff appear in the "moreLikeThis" part. Sometimes Solr is
> > returning me a list, other times a dictionary. Repeating the same call
> > several times the two possibilities occur repeatedly, apparently
> > without a logical pattern, and I have not been able to understand why.
> > And to be precise, in both cases the documents contained in the answer
> > are not necessarily of data type B, as requested by me with the "fq"
> parameter.
> > In the "dictionary" case, there is only one key, which is the
> > UniqueReference of the source document and the corresponding value are
> > similar documents.
> > In the "list" case, the second element contains the required documents
> > So, here is the last question:
> >
> >   1.  I am perfectly aware that I am lost, therefore, what I'm missing?
> > I thank everyone for the attention you have dedicated to me. Greetings
> > from Italy.
> > I'm available for clarifications,
> >
> > Marco
> >
> >
>

Re: Question regarding the MoreLikeThis features

Reply via email to