Re: streaming expressions - sharding memory usage

Joel Bernstein Wed, 10 May 2023 08:00:03 -0700

Unfortunately Solr right now doesn't have a great answer for the kind of
bulk extract with scoring that you're doing. The export handler is designed
for bulk extract but doesn't score. The select handler is designed for top
N retrieval with scoring. I'm surprised that single shard bulk extract with
scoring performs as well as it does.



Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, May 10, 2023 at 10:34 AM Sergio García Maroto <marot...@gmail.com>
wrote:

> Thanks Joel for your answer.
>
> Actually what I need is to return scores from different collections and
> make some calculations on the scores to retrieve at the end people.
> Let me show you a more complex sample. This is really fast on all
> collections in the same servers but very slow once sharding takes into
> place.
>
> select(select(
> rollup(
> innerJoin(
> select(search(person, q="((((SmartSearchS:"madrid [$CU] [$PRJ] [$REC]
> "~100)^4 OR (SmartSearchS:"madrid [$CU] [$PRJ] [$RECL] "~100)^3 OR
> (SmartSearchS:"madrid [$CU] [$PRJ] "~100)^2) OR (((SmartSearchS:(madrid*)))
> OR ((SmartSearchS:("madrid")))^3)) AND ((*:* -StatusSFD:("\*\*\*System
> Delete\*\*\*")) AND type_level:(parent)))", fl="PersonIDDoc,score",
> sort="PersonIDDoc desc,score desc", rows="2147483647"),PersonIDDoc,score as
> PersonScore),
>
> select(rollup(search(contactupdate, q="(((UpdateTextS:(cto)) AND
> IsAttachToPerson:(true)) AND (((AssignConfidentialS:(false) OR
> (AssignAuthorisedEmplIdsS:((cc)))))))", fl="PersonIDDoc,score",
> sort="PersonIDDoc desc,score desc",
>
> rows="2147483647"),over=PersonIDDoc,sum(score),sum(PersonIDDoc)),PersonIDDoc,add(0,0)
> as DocumetScore,sum(score) as ContactUpdateScore),
> on="PersonIDDoc"),
>
> over=PersonIDDoc,sum(PersonScore),sum(ContactUpdateScore),count(PersonIDDoc)),PersonIDDoc,sum(ContactUpdateScore)
> as ContactUpdateScore,sum(PersonScore) as
> PersonScoreTotal,count(PersonIDDoc) as
>
> TokenCount),PersonIDDoc,TokenCount,add(mult(PersonScoreTotal,0.6),mult(ContactUpdateScore,0.4))
> as CalculatedScore)
>
>
> I know is a pretty complex query but takes about 3 seconds on single shard
> and basically explodes on sharding.
>
> Does it mean I can't achive this behauviour sharding collections?
>
> Regards
>
>
>
> On Wed, 10 May 2023 at 16:17, Joel Bernstein <joels...@gmail.com> wrote:
>
> > So the first thing I see is that you're doing a search using the select
> > handler, which is required to sort by score. So in this scenario you will
> > run into deep paging issues as you increase the number of rows. This will
> > effect both memory and performance. A search using the export handler
> will
> > improve throughput as you add shards, without any memory penalty, but it
> > doesn't support scoring.
> >
> > In this case 1000 rows is not that many docs, so I'm surprised the
> penalty
> > is so high. But you will definitely run into large memory and performance
> > penalties if you start pulling larger result sets.
> >
> > Can you describe the exact use case you need to accomplish? For example,
> do
> > you need to extract a large number of documents by joining streams of
> > scored data? Or can you display just the top N documents of the joined
> > streams?
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Wed, May 10, 2023 at 6:47 AM Sergio García Maroto <marot...@gmail.com
> >
> > wrote:
> >
> > > Sure. Let's start by the simplest stream expression.
> > > This one only targets person collection.
> > >
> > > *Stream Expression:*
> > > search(person, q="((((SmartSearchS:"france [$CU] [$PRJ] [$REC] "~100)^4
> > OR
> > > (SmartSearchS:"france [$CU] [$PRJ] [$RECL] "~100)^3 OR
> > > (SmartSearchS:"france [$CU] [$PRJ] "~100)^2) OR
> > (((SmartSearchS:(france*)))
> > > OR ((SmartSearchS:("france")))^3)) AND ((*:* -StatusSFD:("\*\*\*System
> > > Delete\*\*\*")) AND type_level:(parent)))", fl="PersonIDDoc,score",
> > > sort="score desc,PersonIDDoc desc", rows="1000")
> > >
> > > *Schema*
> > > <field name="PersonIDDoc" type="string" indexed="true" stored="true"
> > > docValues="true" />
> > >
> > > *No sharding*
> > > *1 shard 45.38GB with *64,348,740 docs
> > > stream expresion time : 660 ms
> > >
> > > *S**harding*
> > > *2 shards 23GB each*
> > > stream expresion time : 4000 ms
> > >
> > >
> > >
> > > On Wed, 10 May 2023 at 04:45, Joel Bernstein <joels...@gmail.com>
> wrote:
> > >
> > > > Can you share the expressions? Then we can discuss where the sharding
> > > comes
> > > > into play.
> > > >
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > >
> > > > On Tue, May 9, 2023 at 1:17 PM Sergio García Maroto <
> > marot...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am working currently on implementing sharding on current Solr
> Cloud
> > > > > Cluster.
> > > > > Main idea is to be able to scale horizontally.
> > > > >
> > > > > At the moment, without sharding we have all collections sitting on
> > all
> > > > > servers.
> > > > > We have as well pretty heavy streaming expressions returning many
> > ids.
> > > > > Average of 300,000 ids to join.
> > > > >
> > > > > After  doing sharding I see a huge increase on CPU and memory
> usage.
> > > > > Making queries way slower comparing sharding to not sharding.
> > > > >
> > > > > I guess that's  expected bacuase the joins need to send data across
> > > > servers
> > > > > over network.
> > > > >
> > > > > Any thoughs on best practices here. I guess a possible approach is
> to
> > > > split
> > > > > shards in more.
> > > > >
> > > > > Regards
> > > > > Sergio
> > > > >
> > > >
> > >
> >
>

Re: streaming expressions - sharding memory usage

Reply via email to