Re: join query parser performance

Ron Haines Thu, 15 Jun 2023 05:08:32 -0700

yes, we would return 'D'.

So, are you asking why not just do the join in the main index?  I started
that way, then realized that a document that 'belongs' to another doc both
need to be on the same shard for the join to work.  That's when I moved to
the 'fromIndex' approach and created the small 'fromIndex' collection (uner
200k docs), single-sharded, replicated across all of the shards of the main
collection.


On Thu, Jun 15, 2023 at 5:57 AM Mikhail Khludnev <m...@apache.org> wrote:

> Thanks for the clarification, Ron.
> Why the membership is extracted into a separate index?
> Join is heavy anyway, but run it cross core is even more heavier.
>
> Example you give is not really specific. I can implement it via
> fq=-group_member_id:*
>
> Let's extend it
> doc#   group_id.  group_member_id
> 1.            A.                 C
> 2.            B                  -
> 3.            C                  -
> 4.            D                 *G*
> 5.            E                   B
> 6.            F.                   -
> 7.            G
>
> So, if a user runs a query that finds docs A,B,C,D,E,F. (not G)
> Should it return D?
>
>
> On Thu, Jun 15, 2023 at 6:01 AM Ron Haines <mickr...@gmail.com> wrote:
>
> > adding more context as to why we are using the 'join'.
> >
> > We have a collection of documents where all documents have a 'group_id'
> > (which is essentially the doc's id).  And, some docs have a
> > 'group_member_id' that indicates if that doc belongs to a 'group_id'.
> For
> > example:
> >
> > doc#   group_id.  group_member_id
> > 1.            A.                 C
> > 2.            B                  -
> > 3.            C                  -
> > 4.            D                  C
> > 5.            E                   B
> > 6.            F.                   -
> >
> > So, if a user runs a query that finds docs A,B,C,D,E,F we do not want to
> > include any of the documents that belong to any of the group_id's.  So,
> for
> > this search we really want a result count of 3 (docs B, C, F).
> > We want to exclude:
> > A because it belongs to C
> > D because it belongs to C
> > E because it belongs to B
> >
> > This negative 'join' &fq is how we are excluding these docs.  Note that a
> > document can 'belong' to more than 1 document.  So, yes, it does affect
> the
> > result count, if that was a question.
> >
> > Thanks for the suggestions.  I still have to run the test with the
> > 'method=topLevelDv', and I will pursue getting ThreadDumps.  Thx.  More
> to
> > come....
> >
> > On Wed, Jun 14, 2023 at 4:26 PM Mikhail Khludnev <m...@apache.org>
> wrote:
> >
> > > Note: images are shredded in the mailing list.
> > > Well, if we apply heavy operation (join) it's reasonable that it warm
> > CPU.
> > > It should impact number of results. Does it?
> > > Overall, the usage seems non-typical: query looks like role based
> access
> > > control (or group membership problem), but has dismax as a sub-query.
> > Can't
> > > docs be remodelled somehow in a more efficient manner?
> > > It's worth understanding what keeps CPU busy, usually a few thread
> dumps
> > > under load gives a useful clue.
> > > Also, if "to" side is huge and highly sharded, and "from" is small, and
> > > updates are rare, index-time join via {!parent} may work well. Caveat -
> > it
> > > may be cumbersome..
> > > PS, I suggested two jiras earlier, I don't think they are applicable
> > here.
> > >
> > > On Wed, Jun 14, 2023 at 8:26 PM Ron Haines <mickr...@gmail.com> wrote:
> > >
> > > > Fyi, I am finally getting back to this.  I apologize for the delay.
> > > >
> > > >
> > > >
> > > > I am going to try using the ‘method=topLevelDV’ option to see if that
> > > > makes a difference.  I will run same tests used below, and follow up
> > with
> > > > results.
> > > >
> > > >
> > > >
> > > > As far as more details about this scenario:
> > > >
> > > >    - Per the ‘user query’.  Some of them are quite simple, edismax,
> > > >    q=Maricopa county ethel
> > > >    - from a content point of view, updates are not happening very
> > > >    frequently.  Typically get batches of updates spread out over the
> > > course of
> > > >    the day.
> > > >    - not quite sure what you are asking for per the 'collection
> > > >    definitions'.  The main collection is about 27 million docs,
> across
> > 96
> > > >    shards, 2 replicas. The fromIndex 'join' collection is quite
> > > small...about
> > > >    80k docs, single shard, but replicated across the 96 shards.
> > > >    - in the table below are the qtimes, response times, run both
> > > >    with/without using the ‘join’.  Also have resultCount, for
> > reference.
> > > >    - it is a small test sample iof 12 queries, single-threaded,
> > > >       - Note, the qtimes…on average, for this small query set,
> > increases
> > > >       about 40% with the join
> > > >
> > > >
> > > > search_qtime - no join
> > > >
> > > > responseTime - no join
> > > >
> > > > search_qtime - with join
> > > >
> > > > responseTime - with join
> > > >
> > > > resultCount
> > > >
> > > > 1748
> > > >
> > > > 3179
> > > >
> > > > 2834
> > > >
> > > > 4292
> > > >
> > > > 471894
> > > >
> > > > 1557
> > > >
> > > > 2865
> > > >
> > > > 1794
> > > >
> > > > 3108
> > > >
> > > > 332
> > > >
> > > > 929
> > > >
> > > > 2278
> > > >
> > > > 1261
> > > >
> > > > 2654
> > > >
> > > > 541282
> > > >
> > > > 813
> > > >
> > > > 2107
> > > >
> > > > 1036
> > > >
> > > > 2322
> > > >
> > > > 15347
> > > >
> > > > 413
> > > >
> > > > 1730
> > > >
> > > > 539
> > > >
> > > > 1838
> > > >
> > > > 42
> > > >
> > > > 388
> > > >
> > > > 1725
> > > >
> > > > 678
> > > >
> > > > 2027
> > > >
> > > > 313
> > > >
> > > > 1095
> > > >
> > > > 2481
> > > >
> > > > 1453
> > > >
> > > > 2821
> > > >
> > > > 435627
> > > >
> > > > 829
> > > >
> > > > 2263
> > > >
> > > > 1310
> > > >
> > > > 2739
> > > >
> > > > 299
> > > >
> > > > 838
> > > >
> > > > 2103
> > > >
> > > > 1081
> > > >
> > > > 2358
> > > >
> > > > 86049
> > > >
> > > > 1236
> > > >
> > > > 2610
> > > >
> > > > 1911
> > > >
> > > > 3283
> > > >
> > > > 77881
> > > >
> > > > 950
> > > >
> > > > 2274
> > > >
> > > > 1313
> > > >
> > > > 2661
> > > >
> > > > 15160
> > > >
> > > > 763
> > > >
> > > > 2066
> > > >
> > > > 885
> > > >
> > > > 2184
> > > >
> > > > 738
> > > >
> > > > What is most concerning is the cpu increase that we see in Solr.
>  Here
> > > is
> > > > a more ‘concurrent' test, at about 12 qps, but it is not at a 'full'
> > > > load...maybe 50%.  This test 'held up', meaning we did not get into
> any
> > > > trouble.
> > > >
> > > >
> > > > Hope these images comes thru...but, here is a cpu profile for a 1
> hour
> > > > test with no 'join' being used,
> > > >
> > > >
> > > > [image: image.png]
> > > >
> > > > And, here is the same 1 hour test, using the 'join', run twice.  Not
> > the
> > > > difference in 'scale' of cpu of these 2 tests vs. the one above,
> from a
> > > > 'cores' point of view:
> > > > [image: image.png]
> > > >
> > > > Like I said, I'll run these same tests with the ‘method=topLevelDV’,
> > and
> > > > see if it changes behavior.
> > > >
> > > > Thx
> > > >
> > > > Ron Haines
> > > >
> > > > On Thu, May 25, 2023 at 4:29 PM Mikhail Khludnev <m...@apache.org>
> > > wrote:
> > > >
> > > >> Ron, how often both indices are updated? Presumably if they are
> > static,
> > > >> filter cache may help.
> > > >> It's worth making sure that the app gives a chance to filter cache.;
> > > >> To better understand the problem it is worth taking a few treadumps
> > > under
> > > >> load: a deep stack gives a clue for hotspot (or just take a sampling
> > > >> profile). Once we know the hot spot we can think about a workaround.
> > > >> https://issues.apache.org/jira/browse/SOLR-16717 about sharding
> > > >> "fromIndex"
> > > >> https://issues.apache.org/jira/browse/SOLR-16242 about keeping
> > > "local/to"
> > > >> index cache when fromIndex is updated.
> > > >>
> > > >> On Thu, May 25, 2023 at 5:01 PM Andy Lester <a...@petdance.com>
> > wrote:
> > > >>
> > > >> >
> > > >> >
> > > >> > > On May 25, 2023, at 7:51 AM, Ron Haines <mickr...@gmail.com>
> > wrote:
> > > >> > >
> > > >> > > So, when this feature is enabled, this negative &fq gets added:
> > > >> > > -{!join fromIndex=primary_rollup from=group_id_mv
> > to=group_member_id
> > > >> > > score=none}${q}
> > > >> >
> > > >> >
> > > >> > Can we see collection definitions of both the source collection
> and
> > > the
> > > >> > join? Also, a sample query, not just the one parameter? Also, how
> > > often
> > > >> are
> > > >> > either of these collections updated? One thing that killed off an
> > > entire
> > > >> > project that we were doing was that the join table was getting
> > updated
> > > >> > about once a minute, and this destroyed all our caching, and made
> > the
> > > >> > queries we wanted to do unusable.
> > > >> >
> > > >> >
> > > >> > Thanks,
> > > >> > Andy
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Sincerely yours
> > > >> Mikhail Khludnev
> > > >>
> > > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: join query parser performance

Reply via email to