Hi,

*Please find details below:*In our index, we have data of suppliers along
with their products which we display on front-end, wrt search requests.



*Example: For a supplier with id: 678, we have 2 products in our index*
*product-id(unique)*
*document1:*
{
product-id: 123
product-price: 2000rs
product-name: Jute bags

*supplier-id: 678company-name: BagFactoryLimited*
}

*document2:*
{
product-id: 863
product-price: 4500rs
product-name: trolley bags

*supplier-id: 678company-name: BagFactoryLimited*
}

As you can see from above, each document in our index contains
product details i.e product-id, product-price, product-name
and also supplier details i.e supplier-id, company-name

*Problem1: (while indexing)*
Here, whenever there is a change in supplier specific details/field, we are
re-indexing all the products of the supplier although the supplier data
will be the same in all of his products.
*FYI*
We re-index ~5Cr documents per day


*We would like to know, if there is any better way to optimize this which
helps to avoid indexing of redundant data*
*Problem2: (while querying)*
Now, when the data in our current index is queried, we display the single
most relevant product of a supplier. [even if the query matches 1 or more
documents in our index]

For this we are using a collapse query on supplier-id field (as we dont
know relationship between documents) [which is resource intensive]
*Ex: *
fq={!collapse field=supplier-id}

*FYI*
We serve ~25 Lakh Queries per day

*We would like to know if there is any better way to organize index, so
that we can avoid such resource intensive queries, thereby optimizing
search response*

*Our Solr Infra Stats: FYI*
*Version:* v9.6.1
*No. of nodes:* 8
*No. of shards:* 62
*Heap per node: *12G
*RAM per node: *50G
*No. of cpu cores per node: *16
*Count of docs:* ~20Cr
*Size of Index: *~250G
*Routing used:* implicit

Please let us know, if you need any other details

*Thanks & Regards,*
*Uday Kumar*

On Wed, Mar 5, 2025 at 12:08 PM Alessandro Benedetti <a.benede...@sease.io>
wrote:

> Hi Uday,
> Your email is a perfect example  of
> https://en.m.wikipedia.org/wiki/XY_problem.
>
> Both for indexing and query time you need to explain your problems and use
> cases rather than your attempted solutions.
>
>
> Then we'll be able to give some recommendations.
>
>
> On Wed, 5 Mar 2025, 06:39 Uday Kumar, <uday.p...@indiamart.com.invalid>
> wrote:
>
> > Hi,
> > I would like to give some extra context here, so that it would help in
> > getting better suggestions
> >
> >
> > *Our goal:To improve our search system either by optimizing indexing or
> by
> > improving solr response times*
> >
> > *Current approach while indexing at our end:*
> > Even with change in a single field of document, we send the entire
> document
> > for indexing. (~2cr docs are being reindexed on a daily basis)
> > Solr version: V9.6.1
> >
> > *To Optimize Indexing:*
> > 1. POC on external file field: [which stores frequently changed fields in
> > external file and loads after each commit, instead of indexing into solr
> > for each change]
> >
> >
> https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html
> > Observation:
> > a. works only with numeric fields
> > b. Also the community suggested not to go with this, as its old feature.
> > so, I dropped this.
> >
> > 2. POC on Inplace update: (Which helps in indexing fields which contains
> > changes, but not entire document)
> >
> >
> https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html#in-place-updates
> > Observation:
> > a. Works with only single values fields
> > b. Looks promising wrt indexing optimization but not suitable wrt our
> > schema (as we have more multivalued fields). so, dropped
> >
> >
> > Then we moved for alternatives which is expected to help in optimizing
> > response times
> >
> > *To improve Solr Response time:*Nested Documents POC:
> >
> >
> https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html
> > *wrt this statement:*
> > "In terms of performance,* indexing the relationships between documents
> > usually yields much faster queries* than an equivalent "query time join",
> > since the relationships are already stored in the index and do not need
> to
> > be computed"
> >
> > But here we found, complete block will be reindexed even with change in
> > single child document
> > So, we would like to know more about this feature,
> > 1. If this complete block reindexing is heavy when compared with
> > traditional indexing? [As we have more documents for reindexing per
> single
> > day i.e ~2cr]
> > 2. What we can expect with this nested document feature in terms of
> > performance (wrt tradeoff in indexing/querying)
> >  3. If not, do we have any other alternative which we can work upon
> >
> > *Thanks & Regards,*
> > *Uday Kumar*
> >
> >
> > On Mon, Mar 3, 2025 at 7:17 PM Uday Kumar <uday.p...@indiamart.com>
> wrote:
> >
> > > Also in place updates happen on very specific conditions, have you
> > checked
> > > you satisfy them before even attempting to see some sort of impact on
> > your
> > > use case?
> > > Yes we considered those specifications, here, we didnt mean to say
> > > it's not impactful in itself. but with our project & schema
> > >
> > > *Thanks & Regards,*
> > > *Uday Kumar*
> > > *Product Search Tech*
> > >
> > >
> > > On Fri, Feb 28, 2025 at 6:06 PM Alessandro Benedetti <
> > > benedetti.ale...@gmail.com> wrote:
> > >
> > >> What is your problem? Rather than asking about a solution you
> attempted
> > is
> > >> usually better to start from the problem.
> > >>
> > >> You talk about grouping, have you considered field collapsing?
> > >>
> > >> According to my experience going with nested documents rarely justify
> > the
> > >> performance and functional overhead both at indexing and query time.
> > >>
> > >> But sometimes you need them.
> > >>
> > >> Also in place updates happen on very specific conditions, have you
> > checked
> > >> you satisfy them before even attempting to see some sort of impact on
> > your
> > >> use case?
> > >>
> > >> Cheers
> > >>
> > >> On Fri, 28 Feb 2025, 08:30 Uday Kumar, <uday.p...@indiamart.com
> > .invalid>
> > >> wrote:
> > >>
> > >> > Does this mean it will not be impactful in performance to use Nested
> > >> > Indexing in production with such an indexing rate?
> > >> >
> > >> > We have tried POC on inplace updates and found its not impactful
> > either
> > >> wrt
> > >> > our project, so we would not be using this in combination too
> > >> >
> > >> > *Thanks & Regards,*
> > >> > *Uday Kumar*
> > >> > *Product Search Tech*
> > >> >
> > >> >
> > >> > On Thu, Feb 27, 2025 at 12:31 PM Mikhail Khludnev <m...@apache.org>
> > >> wrote:
> > >> >
> > >> > > Changing one child rewrites the whole block period.
> > >> > > However in-place updating child docValues is promising in theory,
> > >> > although
> > >> > > I don't know how it works in practice.
> > >> > >
> > >> > > On Thu, Feb 27, 2025 at 8:05 AM Uday Kumar <
> uday.p...@indiamart.com
> > >> > > .invalid>
> > >> > > wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > > We are doing a POC on indexing nested documents in expectation
> of
> > >> > > reducing
> > >> > > > grouping overhead while querying time.
> > >> > > >
> > >> > > > On Prod Indexing, we are using the traditional approach of
> > >> reindexing
> > >> > the
> > >> > > > entire document if there is any change in any of the fields. [we
> > >> > reindex
> > >> > > > ~2cr documents per day, FYI]
> > >> > > > Solr Version: v9.6.1
> > >> > > >
> > >> > > > But I have come across a caution in solr documentation: *DOC
> > >> > > > <
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html#:~:text=By%20way%20of%20examples%3A%20nested,%2F%20colors)%20and%20supporting%20documentation%20(
> > >> > > > >*,
> > >> > > > where it says: *Solr must internally reindex an entire nested
> > >> document
> > >> > > tree
> > >> > > > if there are updates to it.*
> > >> > > > Which means If a root or parent has 1000 child documents, even
> > with
> > >> a
> > >> > > > change in single document  in any one of the fields, entire
> nested
> > >> > childs
> > >> > > > are reindexed, which is not good enough.
> > >> > > >
> > >> > > > This made us rethink of performance gains that we will have, if
> > >> nested
> > >> > > > documents are used in production.
> > >> > > >
> > >> > > > If that's the case, pls let us know if there are any other
> > solutions
> > >> > > which
> > >> > > > would help us in performance gains.
> > >> > > >
> > >> > > > *Note:*
> > >> > > > We have already done POC on external file fields and In-Place
> > >> updates
> > >> > > where
> > >> > > > we found they are not impactful for our project.
> > >> > > >
> > >> > > > *Thanks & Regards,*
> > >> > > > *Uday Kumar*
> > >> > > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Sincerely yours
> > >> > > Mikhail Khludnev
> > >> > >
> > >> >
> > >>
> > >
> >
>

Reply via email to