Re: streaming expressions - sharding memory usage

Sergio García Maroto Wed, 10 May 2023 07:34:30 -0700

Thanks Joel for your answer.

Actually what I need is to return scores from different collections and
make some calculations on the scores to retrieve at the end people.
Let me show you a more complex sample. This is really fast on all
collections in the same servers but very slow once sharding takes into
place.


select(select(
rollup(
innerJoin(
select(search(person, q="((((SmartSearchS:"madrid [$CU] [$PRJ] [$REC]
"~100)^4 OR (SmartSearchS:"madrid [$CU] [$PRJ] [$RECL] "~100)^3 OR
(SmartSearchS:"madrid [$CU] [$PRJ] "~100)^2) OR (((SmartSearchS:(madrid*)))
OR ((SmartSearchS:("madrid")))^3)) AND ((*:* -StatusSFD:("\*\*\*System
Delete\*\*\*")) AND type_level:(parent)))", fl="PersonIDDoc,score",
sort="PersonIDDoc desc,score desc", rows="2147483647"),PersonIDDoc,score as
PersonScore),

select(rollup(search(contactupdate, q="(((UpdateTextS:(cto)) AND
IsAttachToPerson:(true)) AND (((AssignConfidentialS:(false) OR
(AssignAuthorisedEmplIdsS:((cc)))))))", fl="PersonIDDoc,score",
sort="PersonIDDoc desc,score desc",
rows="2147483647"),over=PersonIDDoc,sum(score),sum(PersonIDDoc)),PersonIDDoc,add(0,0)
as DocumetScore,sum(score) as ContactUpdateScore),
on="PersonIDDoc"),
over=PersonIDDoc,sum(PersonScore),sum(ContactUpdateScore),count(PersonIDDoc)),PersonIDDoc,sum(ContactUpdateScore)
as ContactUpdateScore,sum(PersonScore) as
PersonScoreTotal,count(PersonIDDoc) as
TokenCount),PersonIDDoc,TokenCount,add(mult(PersonScoreTotal,0.6),mult(ContactUpdateScore,0.4))
as CalculatedScore)


I know is a pretty complex query but takes about 3 seconds on single shard
and basically explodes on sharding.

Does it mean I can't achive this behauviour sharding collections?

Regards



On Wed, 10 May 2023 at 16:17, Joel Bernstein <joels...@gmail.com> wrote:

> So the first thing I see is that you're doing a search using the select
> handler, which is required to sort by score. So in this scenario you will
> run into deep paging issues as you increase the number of rows. This will
> effect both memory and performance. A search using the export handler will
> improve throughput as you add shards, without any memory penalty, but it
> doesn't support scoring.
>
> In this case 1000 rows is not that many docs, so I'm surprised the penalty
> is so high. But you will definitely run into large memory and performance
> penalties if you start pulling larger result sets.
>
> Can you describe the exact use case you need to accomplish? For example, do
> you need to extract a large number of documents by joining streams of
> scored data? Or can you display just the top N documents of the joined
> streams?
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Wed, May 10, 2023 at 6:47 AM Sergio García Maroto <marot...@gmail.com>
> wrote:
>
> > Sure. Let's start by the simplest stream expression.
> > This one only targets person collection.
> >
> > *Stream Expression:*
> > search(person, q="((((SmartSearchS:"france [$CU] [$PRJ] [$REC] "~100)^4
> OR
> > (SmartSearchS:"france [$CU] [$PRJ] [$RECL] "~100)^3 OR
> > (SmartSearchS:"france [$CU] [$PRJ] "~100)^2) OR
> (((SmartSearchS:(france*)))
> > OR ((SmartSearchS:("france")))^3)) AND ((*:* -StatusSFD:("\*\*\*System
> > Delete\*\*\*")) AND type_level:(parent)))", fl="PersonIDDoc,score",
> > sort="score desc,PersonIDDoc desc", rows="1000")
> >
> > *Schema*
> > <field name="PersonIDDoc" type="string" indexed="true" stored="true"
> > docValues="true" />
> >
> > *No sharding*
> > *1 shard 45.38GB with *64,348,740 docs
> > stream expresion time : 660 ms
> >
> > *S**harding*
> > *2 shards 23GB each*
> > stream expresion time : 4000 ms
> >
> >
> >
> > On Wed, 10 May 2023 at 04:45, Joel Bernstein <joels...@gmail.com> wrote:
> >
> > > Can you share the expressions? Then we can discuss where the sharding
> > comes
> > > into play.
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Tue, May 9, 2023 at 1:17 PM Sergio García Maroto <
> marot...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am working currently on implementing sharding on current Solr Cloud
> > > > Cluster.
> > > > Main idea is to be able to scale horizontally.
> > > >
> > > > At the moment, without sharding we have all collections sitting on
> all
> > > > servers.
> > > > We have as well pretty heavy streaming expressions returning many
> ids.
> > > > Average of 300,000 ids to join.
> > > >
> > > > After  doing sharding I see a huge increase on CPU and memory usage.
> > > > Making queries way slower comparing sharding to not sharding.
> > > >
> > > > I guess that's  expected bacuase the joins need to send data across
> > > servers
> > > > over network.
> > > >
> > > > Any thoughs on best practices here. I guess a possible approach is to
> > > split
> > > > shards in more.
> > > >
> > > > Regards
> > > > Sergio
> > > >
> > >
> >
>

Re: streaming expressions - sharding memory usage

Reply via email to