adding more context as to why we are using the 'join'.

We have a collection of documents where all documents have a 'group_id'
(which is essentially the doc's id).  And, some docs have a
'group_member_id' that indicates if that doc belongs to a 'group_id'.  For
example:

doc#   group_id.  group_member_id
1.            A.                 C
2.            B                  -
3.            C                  -
4.            D                  C
5.            E                   B
6.            F.                   -

So, if a user runs a query that finds docs A,B,C,D,E,F we do not want to
include any of the documents that belong to any of the group_id's.  So, for
this search we really want a result count of 3 (docs B, C, F).
We want to exclude:
A because it belongs to C
D because it belongs to C
E because it belongs to B

This negative 'join' &fq is how we are excluding these docs.  Note that a
document can 'belong' to more than 1 document.  So, yes, it does affect the
result count, if that was a question.

Thanks for the suggestions.  I still have to run the test with the
'method=topLevelDv', and I will pursue getting ThreadDumps.  Thx.  More to
come....

On Wed, Jun 14, 2023 at 4:26 PM Mikhail Khludnev <m...@apache.org> wrote:

> Note: images are shredded in the mailing list.
> Well, if we apply heavy operation (join) it's reasonable that it warm CPU.
> It should impact number of results. Does it?
> Overall, the usage seems non-typical: query looks like role based access
> control (or group membership problem), but has dismax as a sub-query. Can't
> docs be remodelled somehow in a more efficient manner?
> It's worth understanding what keeps CPU busy, usually a few thread dumps
> under load gives a useful clue.
> Also, if "to" side is huge and highly sharded, and "from" is small, and
> updates are rare, index-time join via {!parent} may work well. Caveat - it
> may be cumbersome..
> PS, I suggested two jiras earlier, I don't think they are applicable here.
>
> On Wed, Jun 14, 2023 at 8:26 PM Ron Haines <mickr...@gmail.com> wrote:
>
> > Fyi, I am finally getting back to this.  I apologize for the delay.
> >
> >
> >
> > I am going to try using the ‘method=topLevelDV’ option to see if that
> > makes a difference.  I will run same tests used below, and follow up with
> > results.
> >
> >
> >
> > As far as more details about this scenario:
> >
> >    - Per the ‘user query’.  Some of them are quite simple, edismax,
> >    q=Maricopa county ethel
> >    - from a content point of view, updates are not happening very
> >    frequently.  Typically get batches of updates spread out over the
> course of
> >    the day.
> >    - not quite sure what you are asking for per the 'collection
> >    definitions'.  The main collection is about 27 million docs, across 96
> >    shards, 2 replicas. The fromIndex 'join' collection is quite
> small...about
> >    80k docs, single shard, but replicated across the 96 shards.
> >    - in the table below are the qtimes, response times, run both
> >    with/without using the ‘join’.  Also have resultCount, for reference.
> >    - it is a small test sample iof 12 queries, single-threaded,
> >       - Note, the qtimes…on average, for this small query set, increases
> >       about 40% with the join
> >
> >
> > search_qtime - no join
> >
> > responseTime - no join
> >
> > search_qtime - with join
> >
> > responseTime - with join
> >
> > resultCount
> >
> > 1748
> >
> > 3179
> >
> > 2834
> >
> > 4292
> >
> > 471894
> >
> > 1557
> >
> > 2865
> >
> > 1794
> >
> > 3108
> >
> > 332
> >
> > 929
> >
> > 2278
> >
> > 1261
> >
> > 2654
> >
> > 541282
> >
> > 813
> >
> > 2107
> >
> > 1036
> >
> > 2322
> >
> > 15347
> >
> > 413
> >
> > 1730
> >
> > 539
> >
> > 1838
> >
> > 42
> >
> > 388
> >
> > 1725
> >
> > 678
> >
> > 2027
> >
> > 313
> >
> > 1095
> >
> > 2481
> >
> > 1453
> >
> > 2821
> >
> > 435627
> >
> > 829
> >
> > 2263
> >
> > 1310
> >
> > 2739
> >
> > 299
> >
> > 838
> >
> > 2103
> >
> > 1081
> >
> > 2358
> >
> > 86049
> >
> > 1236
> >
> > 2610
> >
> > 1911
> >
> > 3283
> >
> > 77881
> >
> > 950
> >
> > 2274
> >
> > 1313
> >
> > 2661
> >
> > 15160
> >
> > 763
> >
> > 2066
> >
> > 885
> >
> > 2184
> >
> > 738
> >
> > What is most concerning is the cpu increase that we see in Solr.   Here
> is
> > a more ‘concurrent' test, at about 12 qps, but it is not at a 'full'
> > load...maybe 50%.  This test 'held up', meaning we did not get into any
> > trouble.
> >
> >
> > Hope these images comes thru...but, here is a cpu profile for a 1 hour
> > test with no 'join' being used,
> >
> >
> > [image: image.png]
> >
> > And, here is the same 1 hour test, using the 'join', run twice.  Not the
> > difference in 'scale' of cpu of these 2 tests vs. the one above, from a
> > 'cores' point of view:
> > [image: image.png]
> >
> > Like I said, I'll run these same tests with the ‘method=topLevelDV’, and
> > see if it changes behavior.
> >
> > Thx
> >
> > Ron Haines
> >
> > On Thu, May 25, 2023 at 4:29 PM Mikhail Khludnev <m...@apache.org>
> wrote:
> >
> >> Ron, how often both indices are updated? Presumably if they are static,
> >> filter cache may help.
> >> It's worth making sure that the app gives a chance to filter cache.;
> >> To better understand the problem it is worth taking a few treadumps
> under
> >> load: a deep stack gives a clue for hotspot (or just take a sampling
> >> profile). Once we know the hot spot we can think about a workaround.
> >> https://issues.apache.org/jira/browse/SOLR-16717 about sharding
> >> "fromIndex"
> >> https://issues.apache.org/jira/browse/SOLR-16242 about keeping
> "local/to"
> >> index cache when fromIndex is updated.
> >>
> >> On Thu, May 25, 2023 at 5:01 PM Andy Lester <a...@petdance.com> wrote:
> >>
> >> >
> >> >
> >> > > On May 25, 2023, at 7:51 AM, Ron Haines <mickr...@gmail.com> wrote:
> >> > >
> >> > > So, when this feature is enabled, this negative &fq gets added:
> >> > > -{!join fromIndex=primary_rollup from=group_id_mv to=group_member_id
> >> > > score=none}${q}
> >> >
> >> >
> >> > Can we see collection definitions of both the source collection and
> the
> >> > join? Also, a sample query, not just the one parameter? Also, how
> often
> >> are
> >> > either of these collections updated? One thing that killed off an
> entire
> >> > project that we were doing was that the join table was getting updated
> >> > about once a minute, and this destroyed all our caching, and made the
> >> > queries we wanted to do unusable.
> >> >
> >> >
> >> > Thanks,
> >> > Andy
> >>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >>
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Reply via email to