Re: Slowness when searching in child documents.

Mikhail Khludnev Wed, 04 Jan 2023 08:10:51 -0800

It looks like
"debug":{ ... "parsedquery":"name:some", "parsedquery_toString":"name:some",
..
"QParser":"LuceneQParser", "filter_queries":["features:here"], "
parsed_filter_queries":["features:here"],


On Wed, Jan 4, 2023 at 3:41 PM Noah Torp-Smith <[email protected]> wrote:

> Thanks again for your input, Mikhail. I need to look more into this
> debugOutput and the way we use `!parent which`.
>
> Can you maybe elaborate on which part of the debug output I should look at
> in order to say "how is it parsed"? Is that output documented somewhere
> (other than the solr source code)?
>
> Best regards,
>
> /Noah
>
>
> --
>
> Noah Torp-Smith ([email protected])
>
> ________________________________
> Fra: Mikhail Khludnev <[email protected]>
> Sendt: 3. januar 2023 19:29
> Til: [email protected] <[email protected]>
> Emne: Re: Slowness when searching in child documents.
>
> Hold on.
>
> >  I remove the first part of the filter (the one with parent which),
> Noah, what's the performance of the child subquery alone?
> q=pid.material_type:(\"lydbog\" \"artikel\")
> What's qtime and how is it parsed?
>
> On Tue, Jan 3, 2023 at 5:55 PM Noah Torp-Smith <[email protected]>
> wrote:
>
> > Thanks for the response. Here is a more hands-on example with measures
> > that maybe illustrates better:
> >
> > We are on solr 9.0.1
> >
> > We send this to solr (it's an equals sign, not a colon after parent
> which,
> > sorry for the confusion on my part):
> >
> > {
> >     "query": "flotte huse",
> >     "filter": [
> >         "{!parent which='doc_type:work'}(pid.material_type:(\"lydbog\"
> > \"artikel\"))",
> >         "doc_type:work"
> >         ],
> >     "fields": "work.workid",
> >     "offset": 0,
> >     "limit": 10,
> >     "params": {
> >         "defType": "edismax",
> >         "qf": [
> >             "work.creator^100",
> >             "work.creator_fuzzy^0.001",
> >             "work.series^75",
> >             "work.subject_bibdk",
> >             "work.subject_fuzzy^0.001",
> >             "work.title^100",
> >             "work.title_fuzzy^0.001"
> >         ],
> >         "pf": [
> >             "work.creator^200",
> >             "work.fictive_character",
> >             "work.series^175",
> >             "work.title^1000"
> >         ],
> >         "pf2": [
> >             "work.creator^200",
> >             "work.fictive_character",
> >             "work.series^175",
> >             "work.title^1000"
> >         ],
> >         "pf3": [
> >             "work.creator^200",
> >             "work.fictive_character",
> >             "work.series^175",
> >             "work.title^1000"
> >         ],
> >         "mm": "2<80%",
> >         "mm.autoRelax": "true",
> >         "ps": 5,
> >         "ps2": 5,
> >         "ps3": 5
> >     }
> > }
> >
> >
> > This fetches 21 workids and it takes more than 20 seconds. If I remove
> the
> > first part of the filter (the one with parent which), it fetches 33
> workids
> > in less than 200 miliseconds. I does not matter if I do it with or
> without
> > the filtering to material types first (as long as I come up with new
> > examples so the filter cache is not being used).
> >
> > So it does not seem to depend on the number of returned documents.
> >
> > Thanks again for your help, it is much appreciated.
> >
> >
> > --
> >
> > Noah Torp-Smith ([email protected])
> >
> > ________________________________
> > Fra: Mikhail Khludnev <[email protected]>
> > Sendt: 3. januar 2023 14:09
> > Til: [email protected] <[email protected]>
> > Emne: Re: Slowness when searching in child documents.
> >
> > Hello, Noah.
> >
> > A few notes: Query time depends on the number of results. When one query
> is
> > slower than another, we can find an excuse in a bigger number of
> enumerated
> > docs.
> > Examine how the query is parsed in debugQuery output. There are many
> tricks
> > and pitfalls in query parsers. eg I'm not sure why you put colon after
> > which, whether you put it so into Solr and how it interprets it.
> > Which version of Solr/Lucene are you running? Some time ago Lucene had no
> > two phase iteration, and was prone to redundant enumerations.
> >
> > > if there is some way to evaluate the search at the work level first,
> and
> > then do the filtering for those works that have manifestations matching
> the
> > child requirements afterwards?
> > That's how it's expected to work. You can confirm your hypothesis by
> > intersecting {!parent ..}.. with work_id:123 whether via fq or +. It
> should
> > turn around in a moment.
> >
> > So, if everything is right you might run just too large indices and have
> to
> > break it into many shards.
> >
> >
> > On Tue, Jan 3, 2023 at 1:12 PM Noah Torp-Smith <[email protected]>
> > wrote:
> >
> > > We are facing a performance issuw when searching in child documents. In
> > > order to explain the issue, I will provide a very simplified excerpt of
> > our
> > > data model.
> > >
> > > We are making a search engine for libraries. What we want to deliver to
> > > the users are "works". An example of a work could be Harry Potter and
> the
> > > Goblet of fire. Each work can have several manifestations; for example
> > > there is a book version of the work, an audiobook, and maybe an e-book.
> > Of
> > > course, there are properties at the work level (like creator, title,
> > > subjects, etc) and other properties at the manifestation level (like
> > > publication year, material type, etc).
> > >
> > > We have modelled this with parent documents and child documents in
> solr,
> > > and have built a search engine on it. The search engine can search for
> > > things like creators, titles, and subjects at the work level, but users
> > > should also be allowed to search for things from a specific year or be
> > able
> > > to specify that the are only interested in things that are available as
> > > e-books.
> > >
> > > We have around 28 million works in the solr and 41 million
> > manifestations,
> > > indexed as child documents (so many works have only one manifestation).
> > >
> > > As long as as the user searches for things at the work level, the
> > > performance is fine. But as you can imagine, when users search for
> things
> > > at the manifestation level, the performance worsens. As an example, if
> we
> > > make a search for a creator, the search executes in less than 200 ms
> and
> > > results in maybe 30 hits. If we add a clause for a material type (with
> a
> > > `{!parent which:'doc_type:work'}materialType:"book"` construction), the
> > > search takes several seconds. In this case we want the filtering to
> books
> > > to be part of the ranking, so putting it in a filter query will pose a
> > > problem.
> > >
> > > I am wondering if there is some way to evaluate the search at the work
> > > level first, and then do the filtering for those works that have
> > > manifestations matching the child requirements afterwards? I could try
> to
> > > do the search for work-level properties first and only fetch IDs and
> then
> > > do the full search with the manifestation-level requirements afterwards
> > and
> > > an added filter query with the IDs, but I am wondering if there is a
> > better
> > > way to do this.
> > >
> > > I have also looked at denormalizing (
> > >
> >
> https://blog.innoventsolutions.com/innovent-solutions-blog/2018/05/avoid-the-parentchild-trap-tips-and-tricks-for-denormalizing-solr-data.html
> > )
> > > and it helps when doing it for a few child fields. But as said, there
> are
> > > more properties in the real model than those I have mentioned here, so
> > that
> > > also involves some complications.
> > >
> > > Kind regards,
> > >
> > > /Noah
> > >
> > >
> > > --
> > >
> > > Noah Torp-Smith ([email protected])
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/MUST_SEARCH
> > A caveat: Cyrillic!
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: Slowness when searching in child documents.

Reply via email to