Re: Question regarding the MoreLikeThis features

Alessandro Benedetti Thu, 17 Mar 2022 05:05:51 -0700

Hi Marco,
I have been working for a long time on the Apache Lucene More Like This
component and its integration in Apache Solr as a committer.
Let me try to summarise a bit how it works to help you with your use case.
You can find benefits from a presentation I gave in Tokyo for the Open
Source summit in 2017(
https://www.slideshare.net/SeaseLtd/how-the-lucene-more-like-this-works)
and a London Lucene/Solr meetup from 2019 (
https://www.youtube.com/watch?v=jkaj89XwHHw partial).


First of all, the More Like This implementation is at Lucene level:

   - it takes as input a document Id(that is fetched from the index) or a
   stream of text
   - it takes various parameters regarding the fields to use and the
   minimum frequencies to consider
   - uses TF/IDF to assign a score to each term of the original
document *(considering
   the whole local index as the corpus)*
   - return *a query with the list of terms*, potentially boosted by the
   term importance (it's the boost parameter)

In the past, I produced plenty of work to restructure the module, to
support BM25 for scoring the terms and make the component more
readable/maintainable but didn't get enough traction, and the contribution
stalled (I am open to resuscitating it if there's interest).

In Apache Solr you have the More Like This integrated in three ways (
https://issues.apache.org/jira/browse/SOLR-13172 adds some details):
1) query component (it is what happens when you add mlt=true as a request
parameter and include the mlt component in a request handler)
This calculates and runs a MLT query for each document in the search
results.
2) request handler
3) query parser -> just build the MLT query

Coming back to your question:

*1.  Which is the best solr request to perform this task?*
If you want to find documents from a document ID, the best option is to use
the* !mlt query parser*, it's compatible with SolrCloud.
The request is processed by a Solr node, it fetches the document
potentially from another shard, and then it builds *locally* the mlt query.
Bear in mind the entire local shard is used for the document frequencies
calculations, if your shards are skewed you should use global IDF.
Using this approach you can build your final query as you like, so you can
add additional boolean clauses and filters on top of the MLT query.

*  2.  Is there a parameter that allows me to restrict the corpus of
documents that are analyzed for the return of similar contents? it should
be noted that this corpus of documents may not contain the initial document
from which I am starting*
No, at the moment the entire corpus of the local core that is processing
the request, is used to calculate the importance of the terms in the seed
document.
As I said before, it's not that important if the document is there locally
or not, as the more like this query parser is SolrCloud compatible.
But if you want to limit the corpus for the term importance calculations,
some Lucene customizations are needed.

Hope this helps,

Cheers

--------------------------
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Mon, 14 Mar 2022 at 22:05, Tim Casey <tca...@gmail.com> wrote:

> Hi,
>
> > Regarding the specific problem on the existence of a specific parameter
> to restrict the corpus of documents that are analyzed for the return of
> similar contents
>
> If you can get this to be a query, and one which might be ordered in a
> useful way, then you are very likely to see what you need in the top 500
> results.  This would be enough for most usage.
> The 'likely' would need to be computed and measured as you produce results.
>
>
> In any event, to restrict the corpus you build a query bit set and use that
> as a filter.  This is fairly easy to code so you can see the results and
> give yourself a way to experiment on what you would do, before deciding
> how/what to do any one particular way.
>
> Or, you directly query and allow solr to do the needed computations within
> each shard.  At this point, I would recommend people who are more versed in
> solr specifics for this kind of computation.
>
> On Mon, Mar 14, 2022 at 12:56 AM Marco D'Ambra <m.dam...@volocom.it>
> wrote:
>
> > Hi Tim,
> >
> > thank you very much for the answer, full of useful advice.
> > I will try to put into practice what you told me to improve the output of
> > the calls.
> > Regarding the specific problem on the existence of a specific parameter
> to
> > restrict the corpus of documents that are analyzed for the return of
> > similar contents, I must admit that I have not yet figured out how to
> > proceed.
> >
> > Thank you very much and have a nice day,
> >
> > Marco
> >
> > -----Original Message-----
> > From: Tim Casey <tca...@gmail.com>
> > Sent: giovedì 10 marzo 2022 19:51
> > To: users@solr.apache.org
> > Subject: Re: Question regarding the MoreLikeThis features
> >
> > Marco,
> >
> > Finding 'similar' documents will end up being weighted by document
> length.
> > I would recommend, at the point of indexing, also indexing an ordered
> > token set of the first 256, 1024 up to around 5k tokens (depending on
> > document lengths).  What this does is allow a vector to vector normalized
> > comparison.  You could then query for similar possibile documents
> directly
> > and build a normalized vector with respect to the query document.
> >
> > Normalizing schemes in something like an inverted index will tend to
> > weight the lower token count documents over higher token count documents.
> > So the above is an attempt to get at a normalized and comparable view
> > between documents independent of size.  Next you end up normalizing by
> the
> > inverse of a commonality.  That is, a more common token is weighted lower
> > than a least common token.  (I would also discount tokens which have a
> raw
> > frequency below 5.). At the point you have a normalized vector, you can
> use
> > that to find similarities weighted by more meaningful tokens.
> >
> > tim
> >
> > On Thu, Mar 10, 2022 at 9:18 AM Marco D'Ambra <m.dam...@volocom.it>
> wrote:
> >
> > > Hi all,
> > > This is my first time writing to this mailing list and I would like to
> > > thank you in advance for your attention.
> > > I am writing because I am having problems using the "MoreLikeThis"
> > > features.
> > > I am working in a Solr cluster (version 8.11.1) consisting of multiple
> > > nodes, each of which contains multiple shards.
> > >
> > > It is a quite big cluster and data is sharded using implicit routing
> > > and documents are distributed by date on monthly shards.
> > >
> > > Here are the fields that I'm using:
> > >
> > >   *   UniqueReference: the unique reference of a document
> > >   *   DocumentDate: the date of a document (in the standar Solr format)
> > >   *   DataType: the data type of the document (let's say that can be A
> or
> > > B)
> > >   *   Content: the content of a document (a string)
> > > Here is what my managed schema looks like ...
> > > <field name="UniqueReference" type="string" indexed="true"
> stored="true"
> > > required="true" />
> > >
> > > <field name="DocumentDate" type="pdate" indexed="true" stored="false"
> > > required="true" />
> > >
> > > <field name="DataType" type="string" indexed="true" stored="false"
> > > required="true" />
> > >
> > > <field name="Content_en" type="text_en" indexed="true" stored="true"
> > > required="false" />
> > > ...
> > >
> > >
> > > The task that I want to perform is the following:
> > > Given the unique reference of a document of type A, I want to find the
> > > documents of data type B and in a fixed time interval, that have the
> > > most similar content.
> > > Here the first questions:
> > >
> > >   1.  Which is the best solr request to perform this task?
> > >   2.  Is there a parameter that allows me to restrict the corpus of
> > > documents that are analyzed for the return of similar contents? it
> > > should be noted that this corpus of documents may not contain the
> > > initial document from which I am starting Initially I thought about
> > > using the "mlt" endpoint, but since there was no parameter in the
> > > documentation that would allow me to select the shard on which to
> > > direct the query (I absolutely need it, otherwise I risk putting a
> > > strain on my cluster), I opted to use the "select" endpoint, with the
> > "mlt"
> > > parameter set to true, and the "shards" parameter.
> > > Those are the parameters that I am using:
> > >
> > >   *   q: "UniqueReference:doc_id"
> > >   *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z]
> > > AND DataType:B) OR (UniqueReference:doc_id)"
> > >   *   mlt: true
> > >   *   mlt.fl: "Content"
> > >   *   shards: "shard_202201"
> > > I realize that the "fq" parameter is used in a bizarre way. In theory
> > > it should be aimed at the documents of the main query (in my case the
> > > source document). It is an attempt to solve problem (2) (which didn't
> > > work, actually).
> > > Anyway, my doubts are not limited to this. What really surprises me is
> > > the structure of the response that Solr returns to me.
> > > The content of response looks like this:
> > > {
> > > "response" : {
> > > "docs" : [],
> > > ...
> > > }
> > >                 "moreLikeThis" : ...
> > >                 }
> > > The weird stuff appear in the "moreLikeThis" part. Sometimes Solr is
> > > returning me a list, other times a dictionary. Repeating the same call
> > > several times the two possibilities occur repeatedly, apparently
> > > without a logical pattern, and I have not been able to understand why.
> > > And to be precise, in both cases the documents contained in the answer
> > > are not necessarily of data type B, as requested by me with the "fq"
> > parameter.
> > > In the "dictionary" case, there is only one key, which is the
> > > UniqueReference of the source document and the corresponding value are
> > > similar documents.
> > > In the "list" case, the second element contains the required documents
> > > So, here is the last question:
> > >
> > >   1.  I am perfectly aware that I am lost, therefore, what I'm missing?
> > > I thank everyone for the attention you have dedicated to me. Greetings
> > > from Italy.
> > > I'm available for clarifications,
> > >
> > > Marco
> > >
> > >
> >
>

Re: Question regarding the MoreLikeThis features

Reply via email to