Re: Block MAX WAND feature use

Tomás Fernández Löbbe Thu, 15 Feb 2024 11:14:03 -0800

One thing you can use to influence ranking while still allowing the
optimization is to use Rank Fields[1]. Multiple field queries should be OK,
but I don't remember off the top of my head if DisMax queries work, I
believe they do, but I don't know why you wouldn't be seeing an improvement.



[1] https://issues.apache.org/jira/browse/SOLR-14590

On Thu, Feb 15, 2024 at 11:05 AM Mikhail Khludnev <m...@apache.org> wrote:

> Don't know exactly. It might be sum, product or any other combination func.
> Another thought:
> MinExactCount optimization always brings top score just skipping weaker
> matches.
> But if you introduce rescoring after extracting BM25 top hits, it loses
> precision:
> Think about a top rating doc, which has fewer matching score, it might not
> be picked due to MinExactCount, and not rescored consequently.
> To summarize: rescoring after MinExactCount is not functionally correct
> though.
>
> On Thu, Feb 15, 2024 at 8:16 PM rajani m <rajinima...@gmail.com> wrote:
>
> > If the boosts are multiple function queries such as the following[1] then
> > the boost query would be a sum function surrounding them, is it? I missed
> > that one.
> >  [1] "sum(product(popularity,2),1.0)" and
> > "recip(ms(NOW/HOUR,date),3.163e-11,1,1)"
> >
> >
> > I will post the question on the dev channel regarding what is expected
> when
> > there are multiple query fields.
> >
> > Thanks again for all your help and pointers.
> >
> >
> > On Thu, Feb 15, 2024 at 2:25 AM Mikhail Khludnev <m...@apache.org>
> wrote:
> >
> > > Hello,
> > > Please check inline below.
> > >
> > > On Thu, Feb 15, 2024 at 2:11 AM rajani m <rajinima...@gmail.com>
> wrote:
> > >
> > > > Yes, rerank works as an alternative, but the rerank only supports one
> > > boost
> > > > query, correct? If there are multiple boost conditions such as boost
> by
> > > > date, season and popularity, putting all of them into one complex
> boost
> > > > query is a hard problem, rerank by LTR can help.  Thank you for that
> > > > pointer.
> > > >
> > > Perhaps I missed something, but why boost query can't combine multiple
> > > clauses?
> > >
> > >
> > > >
> > > > The other limitation is that it is not possible to query multiple
> > > > fields and leverage this feature, that is still an issue, because it
> is
> > > > also a common use case to have title, description and keyword fields
> > > > separated rather than merged into one.
> > > >
> > > It's worth checking with dev@.
> > > I didn't code anything there, my vague understanding is: it skips
> blocks
> > of
> > > docIDs with max tf (terms freqs) fewer than seen so far.
> > > So, such skip conditions should be pushed through query scoring logic,
> > > which might not be (or it might?) obvious in case of the max over
> fields.
> > >
> > >
> > > >
> > > >
> > > > Regards,
> > > > Rajani
> > > >
> > > > On Wed, Feb 14, 2024 at 1:54 PM Mikhail Khludnev <m...@apache.org>
> > > wrote:
> > > >
> > > > > Cool.
> > > > > Btw can you rerank results with the corresponding boost query?
> > > > >
> > > > > On Wed, Feb 14, 2024 at 8:46 PM rajani m <rajinima...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Milkhail,
> > > > > >
> > > > > >   Thanks for that pointer to test with a simple query. It works
> > > > perfectly
> > > > > > with lucene query parser, I see qtime drop by 7 times with this
> > > param.
> > > > > >
> > > > > > With edismax query, it works with certain caveats that "qf"
> (query
> > > > > fields)
> > > > > > must have only one field and the query must not have boost/bf
> > > > parameters.
> > > > > > We would expect it to work with boost params because boost is
> > applied
> > > > > after
> > > > > > the documents matched and scored by block max WAND as first pass.
> > Am
> > > I
> > > > > > right? Without the support to "boost" params, the feature is not
> > > really
> > > > > > usable. The recency and popularity boosts are common to most
> > queries.
> > > > > What
> > > > > > are your thoughts?
> > > > > >
> > > > > > Thank you,
> > > > > > Rajani
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 6, 2024 at 2:54 PM rajani m <rajinima...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > >
> > > > > > > > With a 400M index it's worth experimenting with skipping
> about
> > a
> > > > > > million
> > > > > > > of docs.
> > > > > > >  Is there a param that allows setting how many docs to skip?
> > > > > > >
> > > > > > >  "minExactCount '' which decides how many docs it should care
> to
> > > > score
> > > > > > and
> > > > > > > I tested that with 100, 1000 and 2000 with latency only
> > increased.
> > > > > > >
> > > > > > > Alessandro,
> > > > > > > Assuming it is approximately the total number of files under
> > > > > > > /solr/replica_name/data/index  - is it 222. The top k files
> sizes
> > > > > > >
> > > > > > > rw-r--r-- 1 solr solr  766M Feb  4 04:16 _chg1.cfs
> > > > > > > -rw-r--r-- 1 solr solr 1020M Jan 29 18:37 _ca21.cfs
> > > > > > > -rw-r--r-- 1 solr solr  3.7G Nov  5 23:49 _95vt.cfs
> > > > > > > -rw-r--r-- 1 solr solr  3.8G Jan 15 08:59 _boyy.cfs
> > > > > > > -rw-r--r-- 1 solr solr  3.8G Nov 29 16:01 _9ynt.cfs
> > > > > > > -rw-r--r-- 1 solr solr  3.8G Jan 25 00:47 _c3t7.cfs
> > > > > > > -rw-r--r-- 1 solr solr  4.1G Oct 26 14:37 _8pyh.cfs
> > > > > > > -rw-r--r-- 1 solr solr  4.1G Oct 26 14:38 _7cwt.cfs
> > > > > > > -rw-r--r-- 1 solr solr  4.3G Oct 27 06:04 _7s6c.cfs
> > > > > > > -rw-r--r-- 1 solr solr  4.3G Oct 26 14:37 _7n8z.cfs
> > > > > > > -rw-r--r-- 1 solr solr  4.5G Jan 18 00:30 _dteg.cfs
> > > > > > > -rw-r--r-- 1 solr solr  4.5G Jan 19 17:44 _cwcc.cfs
> > > > > > > -rw-r--r-- 1 solr solr  4.6G Jan 13 07:35 _blix.cfs
> > > > > > > -rw-r--r-- 1 solr solr  4.9G Oct 26 14:39 _8gu9.cfs
> > > > > > > -rw-r--r-- 1 solr solr  4.9G Oct 26 14:38 _3kj9.cfs
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Feb 6, 2024 at 2:45 AM Alessandro Benedetti <
> > > > > > a.benede...@sease.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> It would be interesting to see the level pf fragmentation of
> > each
> > > > > index
> > > > > > >> indeed...
> > > > > > >> I.e. How many segments per core, in a collection
> > > > > > >>
> > > > > > >> On Tue, 6 Feb 2024, 06:59 Mikhail Khludnev, <m...@apache.org>
> > > > wrote:
> > > > > > >>
> > > > > > >> > 200-300 docs might be too few to get significant gain. With
> a
> > > 400M
> > > > > > index
> > > > > > >> > it's worth experimenting with skipping about a million of
> > docs.
> > > > > > >> > In simplified params I mean defType=lucene&df=description.
> > > > > debugQuery
> > > > > > >> might
> > > > > > >> > expose some details as well.
> > > > > > >> > As far as I understand this feature works with large
> segments
> > > > since
> > > > > it
> > > > > > >> > skips a block of a segment, not a segment (?).
> > > > > > >> >
> > > > > > >> > On Mon, Feb 5, 2024 at 8:04 PM rajani m <
> > rajinima...@gmail.com>
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> > > The "numFound" value is 200-300 docs difference when
> > compared
> > > to
> > > > > the
> > > > > > >> > query
> > > > > > >> > > without "minExactFound" param.  The collection has over
> 400m
> > > > > records
> > > > > > >> so
> > > > > > >> > > testing the feature on a large collection.  The
> > numFoundExact
> > > > > param
> > > > > > in
> > > > > > >> > the
> > > > > > >> > > response is consistently false which tells me the feature
> is
> > > > > > >> functioning
> > > > > > >> > > but the results(qtime) are just off, not as expected.
> > > > > > >> > >
> > > > > > >> > > Would a type of query parser matter?I tested without the
> > > > secondary
> > > > > > >> sort,
> > > > > > >> > > even without it there is no improvement in the query time
> > > > latency
> > > > > > and
> > > > > > >> is
> > > > > > >> > > still more than the query without this param.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On Mon, Feb 5, 2024 at 10:34 AM Mikhail Khludnev <
> > > > m...@apache.org
> > > > > >
> > > > > > >> > wrote:
> > > > > > >> > >
> > > > > > >> > > > Hello,
> > > > > > >> > > > How many matches do you have in both cases?
> > > > > > >> > > > I see there's a second sorting expression, it might not
> > > comply
> > > > > > with
> > > > > > >> the
> > > > > > >> > > > requirements.
> > > > > > >> > > > I'd rather start from the simple single query parser,
> just
> > > for
> > > > > the
> > > > > > >> > > > experiments.
> > > > > > >> > > > Note: I never tried it myself.
> > > > > > >> > > >
> > > > > > >> > > > On Mon, Feb 5, 2024 at 6:20 PM rajani m <
> > > > rajinima...@gmail.com>
> > > > > > >> wrote:
> > > > > > >> > > >
> > > > > > >> > > > > I ran performance tests with different query sets and
> > the
> > > > > > results
> > > > > > >> > look
> > > > > > >> > > no
> > > > > > >> > > > > good, it is adding to the latency around ~15% instead
> of
> > > > > > reducing
> > > > > > >> or
> > > > > > >> > > even
> > > > > > >> > > > > matching.  Not sure if I am missing something in the
> > > config
> > > > or
> > > > > > it
> > > > > > >> is
> > > > > > >> > an
> > > > > > >> > > > > issue.
> > > > > > >> > > > >
> > > > > > >> > > > > Here is an example query *without* WAND query
> parameter
> > > > > > >> > > > > select?&fl=id,ext_id&start=0&q.op=OR&sort=score
> > > desc,ext_id
> > > > > > >> > > > > asc&rows=10&q=white flowers
> > > card&defType=edismax&qf=keywords
> > > > > > >> > > description
> > > > > > >> > > > > title
> > > > > > >> > > > > vs
> > > > > > >> > > > > *With* WAND query parameter
> > > > > > >> > > > > select?&fl=id,ext_id&start=0&q.op=OR&sort=score
> > > desc,ext_id
> > > > > > >> > > > > asc&rows=10&q=white flowers
> > > card&defType=edismax&qf=keywords
> > > > > > >> > > description
> > > > > > >> > > > > title*&minExactCount=10*
> > > > > > >> > > > >
> > > > > > >> > > > > On Thu, Feb 1, 2024 at 8:36 AM rajani m <
> > > > > rajinima...@gmail.com>
> > > > > > >> > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hi Ishan,
> > > > > > >> > > > > >    I have looked into that doc, and it looks like
> the
> > > solr
> > > > > > >> version
> > > > > > >> > > has
> > > > > > >> > > > to
> > > > > > >> > > > > > be >8.8 and the config needed is to add the query
> > > > parameter
> > > > > > >> > > > > "&minExactCount=k"
> > > > > > >> > > > > > where k is 10 or 100 depending on the accuracy of
> the
> > > > first
> > > > > k
> > > > > > >> docs.
> > > > > > >> > > > > >
> > > > > > >> > > > > > I ran a query performance test using an internal
> tool,
> > > > with
> > > > > k
> > > > > > >> set
> > > > > > >> > to
> > > > > > >> > > 10
> > > > > > >> > > > > > and 100, which barely showed any difference in query
> > > time
> > > > > > >> latency,
> > > > > > >> > I
> > > > > > >> > > > > > didn't expect that so I was wondering if there is
> any
> > > > > > >> > configuration I
> > > > > > >> > > > > > missed.
> > > > > > >> > > > > >
> > > > > > >> > > > > > I will run a couple more tests with different query
> > sets
> > > > > > >> meanwhile
> > > > > > >> > > and
> > > > > > >> > > > > dig
> > > > > > >> > > > > > further into implementation of the feature to see
> if I
> > > am
> > > > > > >> missing
> > > > > > >> > any
> > > > > > >> > > > > > config here. Appreciate any suggestions.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Thanks,
> > > > > > >> > > > > > Rajani
> > > > > > >> > > > > >
> > > > > > >> > > > > > On Thu, Feb 1, 2024 at 12:53 AM Ishan
> Chattopadhyaya <
> > > > > > >> > > > > > ichattopadhy...@gmail.com> wrote:
> > > > > > >> > > > > >
> > > > > > >> > > > > >> Is it possible to benchmark the query performance
> > > across
> > > > a
> > > > > > >> larger
> > > > > > >> > > set
> > > > > > >> > > > of
> > > > > > >> > > > > >> queries? You can leverage Solr Bench, if needed.
> > > > > > >> > > > > >> https://github.com/fullstorydev/solr-bench
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> On Thu, 1 Feb, 2024, 11:20 am Ishan
> Chattopadhyaya, <
> > > > > > >> > > > > >> ichattopadhy...@gmail.com> wrote:
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> > Some documentation is here
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >>
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_6/common-query-parameters.html#minexactcount-parameter
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> > On Thu, 1 Feb, 2024, 9:53 am rajani m, <
> > > > > > >> rajinima...@gmail.com>
> > > > > > >> > > > wrote:
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> >> Hi All,
> > > > > > >> > > > > >> >>
> > > > > > >> > > > > >> >>   To leverage the query time improvements that
> > come
> > > > with
> > > > > > the
> > > > > > >> > > Block
> > > > > > >> > > > > MAX
> > > > > > >> > > > > >> >> WAND
> > > > > > >> > > > > >> >> feature, what are the required configurations?
> > > > > > >> > > > > >> >>
> > > > > > >> > > > > >> >> I am on solr 9.1.1 version. As per docs,
> including
> > > > > > >> > > > > "minExactCount=100"
> > > > > > >> > > > > >> >> query param should do it, however I don't see
> any
> > > drop
> > > > > in
> > > > > > >> query
> > > > > > >> > > > time,
> > > > > > >> > > > > >> it
> > > > > > >> > > > > >> >> is
> > > > > > >> > > > > >> >> more or less the same. Am I missing something?
> > > > > > >> > > > > >> >>
> > > > > > >> > > > > >> >> The queries I tested with are standard ones with
> > > > edismax
> > > > > > as
> > > > > > >> > query
> > > > > > >> > > > > >> parser
> > > > > > >> > > > > >> >> and query text is converted into boolean clauses
> > and
> > > > > query
> > > > > > >> has
> > > > > > >> > 2
> > > > > > >> > > > > boost
> > > > > > >> > > > > >> >> params by date and popularity field. I included
> > the
> > > > > > >> > > "minExactCount"
> > > > > > >> > > > > >> set to
> > > > > > >> > > > > >> >> as low as 10 and 100 and increased to 1000 but
> > > didn't
> > > > > see
> > > > > > >> key
> > > > > > >> > > > change
> > > > > > >> > > > > in
> > > > > > >> > > > > >> >> query time, it was about the same.
> > > > > > >> > > > > >> >>
> > > > > > >> > > > > >> >>  Would including boost or use of edismax parser
> > not
> > > > > > benefit
> > > > > > >> > with
> > > > > > >> > > > > block
> > > > > > >> > > > > >> MAX
> > > > > > >> > > > > >> >> WAND? Example query  /select?q=((white) AND
> (roses
> > > OR
> > > > > > >> > > > > >> >> jasmine))&defType=edismax&qf=keywords
> description
> > > > > > >> > > > > >> >>
> > > > > > title&pf2=title&bf=recip(ms(NOW,datefield),3.16e-11,1,1)^2.0
> > > > > > >> > > > > >> >>
> > > > > > >> > > > > >> >>
> > > > > > >> > > > > >> >> Thank you,
> > > > > > >> > > > > >> >> Rajani
> > > > > > >> > > > > >> >>
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >>
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > --
> > > > > > >> > > > Sincerely yours
> > > > > > >> > > > Mikhail Khludnev
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > Sincerely yours
> > > > > > >> > Mikhail Khludnev
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > >
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Block MAX WAND feature use

Reply via email to