Sv: Slowness when searching in child documents.

Noah Torp-Smith Wed, 04 Jan 2023 04:41:23 -0800

Thanks again for your input, Mikhail. I need to look more into this debugOutput 
and the way we use `!parent which`.


Can you maybe elaborate on which part of the debug output I should look at in 
order to say "how is it parsed"? Is that output documented somewhere (other 
than the solr source code)?

Best regards,

/Noah


--

Noah Torp-Smith (n...@dbc.dk)

________________________________
Fra: Mikhail Khludnev <m...@apache.org>
Sendt: 3. januar 2023 19:29
Til: users@solr.apache.org <users@solr.apache.org>
Emne: Re: Slowness when searching in child documents.

Hold on.

>  I remove the first part of the filter (the one with parent which),
Noah, what's the performance of the child subquery alone?
q=pid.material_type:(\"lydbog\" \"artikel\")
What's qtime and how is it parsed?

On Tue, Jan 3, 2023 at 5:55 PM Noah Torp-Smith <n...@dbc.dk.invalid> wrote:

> Thanks for the response. Here is a more hands-on example with measures
> that maybe illustrates better:
>
> We are on solr 9.0.1
>
> We send this to solr (it's an equals sign, not a colon after parent which,
> sorry for the confusion on my part):
>
> {
>     "query": "flotte huse",
>     "filter": [
>         "{!parent which='doc_type:work'}(pid.material_type:(\"lydbog\"
> \"artikel\"))",
>         "doc_type:work"
>         ],
>     "fields": "work.workid",
>     "offset": 0,
>     "limit": 10,
>     "params": {
>         "defType": "edismax",
>         "qf": [
>             "work.creator^100",
>             "work.creator_fuzzy^0.001",
>             "work.series^75",
>             "work.subject_bibdk",
>             "work.subject_fuzzy^0.001",
>             "work.title^100",
>             "work.title_fuzzy^0.001"
>         ],
>         "pf": [
>             "work.creator^200",
>             "work.fictive_character",
>             "work.series^175",
>             "work.title^1000"
>         ],
>         "pf2": [
>             "work.creator^200",
>             "work.fictive_character",
>             "work.series^175",
>             "work.title^1000"
>         ],
>         "pf3": [
>             "work.creator^200",
>             "work.fictive_character",
>             "work.series^175",
>             "work.title^1000"
>         ],
>         "mm": "2<80%",
>         "mm.autoRelax": "true",
>         "ps": 5,
>         "ps2": 5,
>         "ps3": 5
>     }
> }
>
>
> This fetches 21 workids and it takes more than 20 seconds. If I remove the
> first part of the filter (the one with parent which), it fetches 33 workids
> in less than 200 miliseconds. I does not matter if I do it with or without
> the filtering to material types first (as long as I come up with new
> examples so the filter cache is not being used).
>
> So it does not seem to depend on the number of returned documents.
>
> Thanks again for your help, it is much appreciated.
>
>
> --
>
> Noah Torp-Smith (n...@dbc.dk)
>
> ________________________________
> Fra: Mikhail Khludnev <m...@apache.org>
> Sendt: 3. januar 2023 14:09
> Til: users@solr.apache.org <users@solr.apache.org>
> Emne: Re: Slowness when searching in child documents.
>
> Hello, Noah.
>
> A few notes: Query time depends on the number of results. When one query is
> slower than another, we can find an excuse in a bigger number of enumerated
> docs.
> Examine how the query is parsed in debugQuery output. There are many tricks
> and pitfalls in query parsers. eg I'm not sure why you put colon after
> which, whether you put it so into Solr and how it interprets it.
> Which version of Solr/Lucene are you running? Some time ago Lucene had no
> two phase iteration, and was prone to redundant enumerations.
>
> > if there is some way to evaluate the search at the work level first, and
> then do the filtering for those works that have manifestations matching the
> child requirements afterwards?
> That's how it's expected to work. You can confirm your hypothesis by
> intersecting {!parent ..}.. with work_id:123 whether via fq or +. It should
> turn around in a moment.
>
> So, if everything is right you might run just too large indices and have to
> break it into many shards.
>
>
> On Tue, Jan 3, 2023 at 1:12 PM Noah Torp-Smith <n...@dbc.dk.invalid>
> wrote:
>
> > We are facing a performance issuw when searching in child documents. In
> > order to explain the issue, I will provide a very simplified excerpt of
> our
> > data model.
> >
> > We are making a search engine for libraries. What we want to deliver to
> > the users are "works". An example of a work could be Harry Potter and the
> > Goblet of fire. Each work can have several manifestations; for example
> > there is a book version of the work, an audiobook, and maybe an e-book.
> Of
> > course, there are properties at the work level (like creator, title,
> > subjects, etc) and other properties at the manifestation level (like
> > publication year, material type, etc).
> >
> > We have modelled this with parent documents and child documents in solr,
> > and have built a search engine on it. The search engine can search for
> > things like creators, titles, and subjects at the work level, but users
> > should also be allowed to search for things from a specific year or be
> able
> > to specify that the are only interested in things that are available as
> > e-books.
> >
> > We have around 28 million works in the solr and 41 million
> manifestations,
> > indexed as child documents (so many works have only one manifestation).
> >
> > As long as as the user searches for things at the work level, the
> > performance is fine. But as you can imagine, when users search for things
> > at the manifestation level, the performance worsens. As an example, if we
> > make a search for a creator, the search executes in less than 200 ms and
> > results in maybe 30 hits. If we add a clause for a material type (with a
> > `{!parent which:'doc_type:work'}materialType:"book"` construction), the
> > search takes several seconds. In this case we want the filtering to books
> > to be part of the ranking, so putting it in a filter query will pose a
> > problem.
> >
> > I am wondering if there is some way to evaluate the search at the work
> > level first, and then do the filtering for those works that have
> > manifestations matching the child requirements afterwards? I could try to
> > do the search for work-level properties first and only fetch IDs and then
> > do the full search with the manifestation-level requirements afterwards
> and
> > an added filter query with the IDs, but I am wondering if there is a
> better
> > way to do this.
> >
> > I have also looked at denormalizing (
> >
> https://blog.innoventsolutions.com/innovent-solutions-blog/2018/05/avoid-the-parentchild-trap-tips-and-tricks-for-denormalizing-solr-data.html
> )
> > and it helps when doing it for a few child fields. But as said, there are
> > more properties in the real model than those I have mentioned here, so
> that
> > also involves some complications.
> >
> > Kind regards,
> >
> > /Noah
> >
> >
> > --
> >
> > Noah Torp-Smith (n...@dbc.dk)
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>


--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Sv: Slowness when searching in child documents.

Reply via email to