Re: Performance Considerations While Indexing Nested Documents: SolrV9.6.1

David Smiley Sat, 05 Apr 2025 11:05:11 -0700

If you want to improve the collapsing performance, you might use a numeric
supplier id to collapse on instead of a string.
You've denormalized things, which is typically the right trade-off.  As you
have observed, the down-side is that you have to reindex lots of stuff when
a related entity changes.


The limited support of in-place updates is a place where Lucene and Solr
could improve.  A well-resourced search team might invest in such a
project, as it's rather in-reach with some experience in Lucene depths.
Lucene supports updates to BinaryDocValues; Solr doesn't yet expose that.
If it did, you could put whatever data you want in there, encoded as you
like.  But BinaryDocValues are kind of useless for anything other than
retrieval (no sort, facet, relevance, filter).

I would guess you are sharding by supplier (i.e. suppliers stay within a
shard), as you are collapsing on that.  Right?

~ David

On Mon, Mar 24, 2025 at 1:50 PM Uday Kumar <uday.p...@indiamart.com.invalid>
wrote:

> Hi all,
>
> Any help here, wrt above mentioned details?
>
> *Thanks & Regards,*
> *Uday Kumar*
> *Product Search Tech*
>
> On Mon, Mar 17, 2025, 19:24 Uday Kumar <uday.p...@indiamart.com> wrote:
>
> > Hi,
> >
> >
> > *Please find details below:*In our index, we have data of suppliers along
> > with their products which we display on front-end, wrt search requests.
> >
> >
> >
> > *Example: For a supplier with id: 678, we have 2 products in our index*
> > *product-id(unique)*
> > *document1:*
> > {
> > product-id: 123
> > product-price: 2000rs
> > product-name: Jute bags
> >
> > *supplier-id: 678company-name: BagFactoryLimited*
> > }
> >
> > *document2:*
> > {
> > product-id: 863
> > product-price: 4500rs
> > product-name: trolley bags
> >
> > *supplier-id: 678company-name: BagFactoryLimited*
> > }
> >
> > As you can see from above, each document in our index contains
> > product details i.e product-id, product-price, product-name
> > and also supplier details i.e supplier-id, company-name
> >
> > *Problem1: (while indexing)*
> > Here, whenever there is a change in supplier specific details/field, we
> > are re-indexing all the products of the supplier although the supplier
> data
> > will be the same in all of his products.
> > *FYI*
> > We re-index ~5Cr documents per day
> >
> >
> > *We would like to know, if there is any better way to optimize this which
> > helps to avoid indexing of redundant data*
> > *Problem2: (while querying)*
> > Now, when the data in our current index is queried, we display the single
> > most relevant product of a supplier. [even if the query matches 1 or more
> > documents in our index]
> >
> > For this we are using a collapse query on supplier-id field (as we dont
> > know relationship between documents) [which is resource intensive]
> > *Ex: *
> > fq={!collapse field=supplier-id}
> >
> > *FYI*
> > We serve ~25 Lakh Queries per day
> >
> > *We would like to know if there is any better way to organize index, so
> > that we can avoid such resource intensive queries, thereby optimizing
> > search response*
> >
> > *Our Solr Infra Stats: FYI*
> > *Version:* v9.6.1
> > *No. of nodes:* 8
> > *No. of shards:* 62
> > *Heap per node: *12G
> > *RAM per node: *50G
> > *No. of cpu cores per node: *16
> > *Count of docs:* ~20Cr
> > *Size of Index: *~250G
> > *Routing used:* implicit
> >
> > Please let us know, if you need any other details
> >
> > *Thanks & Regards,*
> > *Uday Kumar*
> >
> > On Wed, Mar 5, 2025 at 12:08 PM Alessandro Benedetti <
> a.benede...@sease.io>
> > wrote:
> >
> >> Hi Uday,
> >> Your email is a perfect example  of
> >> https://en.m.wikipedia.org/wiki/XY_problem.
> >>
> >> Both for indexing and query time you need to explain your problems and
> use
> >> cases rather than your attempted solutions.
> >>
> >>
> >> Then we'll be able to give some recommendations.
> >>
> >>
> >> On Wed, 5 Mar 2025, 06:39 Uday Kumar, <uday.p...@indiamart.com.invalid>
> >> wrote:
> >>
> >> > Hi,
> >> > I would like to give some extra context here, so that it would help in
> >> > getting better suggestions
> >> >
> >> >
> >> > *Our goal:To improve our search system either by optimizing indexing
> or
> >> by
> >> > improving solr response times*
> >> >
> >> > *Current approach while indexing at our end:*
> >> > Even with change in a single field of document, we send the entire
> >> document
> >> > for indexing. (~2cr docs are being reindexed on a daily basis)
> >> > Solr version: V9.6.1
> >> >
> >> > *To Optimize Indexing:*
> >> > 1. POC on external file field: [which stores frequently changed fields
> >> in
> >> > external file and loads after each commit, instead of indexing into
> solr
> >> > for each change]
> >> >
> >> >
> >>
> https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html
> >> > Observation:
> >> > a. works only with numeric fields
> >> > b. Also the community suggested not to go with this, as its old
> feature.
> >> > so, I dropped this.
> >> >
> >> > 2. POC on Inplace update: (Which helps in indexing fields which
> contains
> >> > changes, but not entire document)
> >> >
> >> >
> >>
> https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html#in-place-updates
> >> > Observation:
> >> > a. Works with only single values fields
> >> > b. Looks promising wrt indexing optimization but not suitable wrt our
> >> > schema (as we have more multivalued fields). so, dropped
> >> >
> >> >
> >> > Then we moved for alternatives which is expected to help in optimizing
> >> > response times
> >> >
> >> > *To improve Solr Response time:*Nested Documents POC:
> >> >
> >> >
> >>
> https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html
> >> > *wrt this statement:*
> >> > "In terms of performance,* indexing the relationships between
> documents
> >> > usually yields much faster queries* than an equivalent "query time
> >> join",
> >> > since the relationships are already stored in the index and do not
> need
> >> to
> >> > be computed"
> >> >
> >> > But here we found, complete block will be reindexed even with change
> in
> >> > single child document
> >> > So, we would like to know more about this feature,
> >> > 1. If this complete block reindexing is heavy when compared with
> >> > traditional indexing? [As we have more documents for reindexing per
> >> single
> >> > day i.e ~2cr]
> >> > 2. What we can expect with this nested document feature in terms of
> >> > performance (wrt tradeoff in indexing/querying)
> >> >  3. If not, do we have any other alternative which we can work upon
> >> >
> >> > *Thanks & Regards,*
> >> > *Uday Kumar*
> >> >
> >> >
> >> > On Mon, Mar 3, 2025 at 7:17 PM Uday Kumar <uday.p...@indiamart.com>
> >> wrote:
> >> >
> >> > > Also in place updates happen on very specific conditions, have you
> >> > checked
> >> > > you satisfy them before even attempting to see some sort of impact
> on
> >> > your
> >> > > use case?
> >> > > Yes we considered those specifications, here, we didnt mean to say
> >> > > it's not impactful in itself. but with our project & schema
> >> > >
> >> > > *Thanks & Regards,*
> >> > > *Uday Kumar*
> >> > > *Product Search Tech*
> >> > >
> >> > >
> >> > > On Fri, Feb 28, 2025 at 6:06 PM Alessandro Benedetti <
> >> > > benedetti.ale...@gmail.com> wrote:
> >> > >
> >> > >> What is your problem? Rather than asking about a solution you
> >> attempted
> >> > is
> >> > >> usually better to start from the problem.
> >> > >>
> >> > >> You talk about grouping, have you considered field collapsing?
> >> > >>
> >> > >> According to my experience going with nested documents rarely
> justify
> >> > the
> >> > >> performance and functional overhead both at indexing and query
> time.
> >> > >>
> >> > >> But sometimes you need them.
> >> > >>
> >> > >> Also in place updates happen on very specific conditions, have you
> >> > checked
> >> > >> you satisfy them before even attempting to see some sort of impact
> on
> >> > your
> >> > >> use case?
> >> > >>
> >> > >> Cheers
> >> > >>
> >> > >> On Fri, 28 Feb 2025, 08:30 Uday Kumar, <uday.p...@indiamart.com
> >> > .invalid>
> >> > >> wrote:
> >> > >>
> >> > >> > Does this mean it will not be impactful in performance to use
> >> Nested
> >> > >> > Indexing in production with such an indexing rate?
> >> > >> >
> >> > >> > We have tried POC on inplace updates and found its not impactful
> >> > either
> >> > >> wrt
> >> > >> > our project, so we would not be using this in combination too
> >> > >> >
> >> > >> > *Thanks & Regards,*
> >> > >> > *Uday Kumar*
> >> > >> > *Product Search Tech*
> >> > >> >
> >> > >> >
> >> > >> > On Thu, Feb 27, 2025 at 12:31 PM Mikhail Khludnev <
> m...@apache.org
> >> >
> >> > >> wrote:
> >> > >> >
> >> > >> > > Changing one child rewrites the whole block period.
> >> > >> > > However in-place updating child docValues is promising in
> theory,
> >> > >> > although
> >> > >> > > I don't know how it works in practice.
> >> > >> > >
> >> > >> > > On Thu, Feb 27, 2025 at 8:05 AM Uday Kumar <
> >> uday.p...@indiamart.com
> >> > >> > > .invalid>
> >> > >> > > wrote:
> >> > >> > >
> >> > >> > > > Hi all,
> >> > >> > > > We are doing a POC on indexing nested documents in
> expectation
> >> of
> >> > >> > > reducing
> >> > >> > > > grouping overhead while querying time.
> >> > >> > > >
> >> > >> > > > On Prod Indexing, we are using the traditional approach of
> >> > >> reindexing
> >> > >> > the
> >> > >> > > > entire document if there is any change in any of the fields.
> >> [we
> >> > >> > reindex
> >> > >> > > > ~2cr documents per day, FYI]
> >> > >> > > > Solr Version: v9.6.1
> >> > >> > > >
> >> > >> > > > But I have come across a caution in solr documentation: *DOC
> >> > >> > > > <
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html#:~:text=By%20way%20of%20examples%3A%20nested,%2F%20colors)%20and%20supporting%20documentation%20(
> >> > >> > > > >*,
> >> > >> > > > where it says: *Solr must internally reindex an entire nested
> >> > >> document
> >> > >> > > tree
> >> > >> > > > if there are updates to it.*
> >> > >> > > > Which means If a root or parent has 1000 child documents,
> even
> >> > with
> >> > >> a
> >> > >> > > > change in single document  in any one of the fields, entire
> >> nested
> >> > >> > childs
> >> > >> > > > are reindexed, which is not good enough.
> >> > >> > > >
> >> > >> > > > This made us rethink of performance gains that we will have,
> if
> >> > >> nested
> >> > >> > > > documents are used in production.
> >> > >> > > >
> >> > >> > > > If that's the case, pls let us know if there are any other
> >> > solutions
> >> > >> > > which
> >> > >> > > > would help us in performance gains.
> >> > >> > > >
> >> > >> > > > *Note:*
> >> > >> > > > We have already done POC on external file fields and In-Place
> >> > >> updates
> >> > >> > > where
> >> > >> > > > we found they are not impactful for our project.
> >> > >> > > >
> >> > >> > > > *Thanks & Regards,*
> >> > >> > > > *Uday Kumar*
> >> > >> > > >
> >> > >> > >
> >> > >> > >
> >> > >> > > --
> >> > >> > > Sincerely yours
> >> > >> > > Mikhail Khludnev
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> >
> >>
> >
>

Re: Performance Considerations While Indexing Nested Documents: SolrV9.6.1

Reply via email to