Hi all,

Any help here, wrt above mentioned details?

*Thanks & Regards,*
*Uday Kumar*
*Product Search Tech*

On Mon, Mar 17, 2025, 19:24 Uday Kumar <uday.p...@indiamart.com> wrote:

> Hi,
>
>
> *Please find details below:*In our index, we have data of suppliers along
> with their products which we display on front-end, wrt search requests.
>
>
>
> *Example: For a supplier with id: 678, we have 2 products in our index*
> *product-id(unique)*
> *document1:*
> {
> product-id: 123
> product-price: 2000rs
> product-name: Jute bags
>
> *supplier-id: 678company-name: BagFactoryLimited*
> }
>
> *document2:*
> {
> product-id: 863
> product-price: 4500rs
> product-name: trolley bags
>
> *supplier-id: 678company-name: BagFactoryLimited*
> }
>
> As you can see from above, each document in our index contains
> product details i.e product-id, product-price, product-name
> and also supplier details i.e supplier-id, company-name
>
> *Problem1: (while indexing)*
> Here, whenever there is a change in supplier specific details/field, we
> are re-indexing all the products of the supplier although the supplier data
> will be the same in all of his products.
> *FYI*
> We re-index ~5Cr documents per day
>
>
> *We would like to know, if there is any better way to optimize this which
> helps to avoid indexing of redundant data*
> *Problem2: (while querying)*
> Now, when the data in our current index is queried, we display the single
> most relevant product of a supplier. [even if the query matches 1 or more
> documents in our index]
>
> For this we are using a collapse query on supplier-id field (as we dont
> know relationship between documents) [which is resource intensive]
> *Ex: *
> fq={!collapse field=supplier-id}
>
> *FYI*
> We serve ~25 Lakh Queries per day
>
> *We would like to know if there is any better way to organize index, so
> that we can avoid such resource intensive queries, thereby optimizing
> search response*
>
> *Our Solr Infra Stats: FYI*
> *Version:* v9.6.1
> *No. of nodes:* 8
> *No. of shards:* 62
> *Heap per node: *12G
> *RAM per node: *50G
> *No. of cpu cores per node: *16
> *Count of docs:* ~20Cr
> *Size of Index: *~250G
> *Routing used:* implicit
>
> Please let us know, if you need any other details
>
> *Thanks & Regards,*
> *Uday Kumar*
>
> On Wed, Mar 5, 2025 at 12:08 PM Alessandro Benedetti <a.benede...@sease.io>
> wrote:
>
>> Hi Uday,
>> Your email is a perfect example  of
>> https://en.m.wikipedia.org/wiki/XY_problem.
>>
>> Both for indexing and query time you need to explain your problems and use
>> cases rather than your attempted solutions.
>>
>>
>> Then we'll be able to give some recommendations.
>>
>>
>> On Wed, 5 Mar 2025, 06:39 Uday Kumar, <uday.p...@indiamart.com.invalid>
>> wrote:
>>
>> > Hi,
>> > I would like to give some extra context here, so that it would help in
>> > getting better suggestions
>> >
>> >
>> > *Our goal:To improve our search system either by optimizing indexing or
>> by
>> > improving solr response times*
>> >
>> > *Current approach while indexing at our end:*
>> > Even with change in a single field of document, we send the entire
>> document
>> > for indexing. (~2cr docs are being reindexed on a daily basis)
>> > Solr version: V9.6.1
>> >
>> > *To Optimize Indexing:*
>> > 1. POC on external file field: [which stores frequently changed fields
>> in
>> > external file and loads after each commit, instead of indexing into solr
>> > for each change]
>> >
>> >
>> https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html
>> > Observation:
>> > a. works only with numeric fields
>> > b. Also the community suggested not to go with this, as its old feature.
>> > so, I dropped this.
>> >
>> > 2. POC on Inplace update: (Which helps in indexing fields which contains
>> > changes, but not entire document)
>> >
>> >
>> https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html#in-place-updates
>> > Observation:
>> > a. Works with only single values fields
>> > b. Looks promising wrt indexing optimization but not suitable wrt our
>> > schema (as we have more multivalued fields). so, dropped
>> >
>> >
>> > Then we moved for alternatives which is expected to help in optimizing
>> > response times
>> >
>> > *To improve Solr Response time:*Nested Documents POC:
>> >
>> >
>> https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html
>> > *wrt this statement:*
>> > "In terms of performance,* indexing the relationships between documents
>> > usually yields much faster queries* than an equivalent "query time
>> join",
>> > since the relationships are already stored in the index and do not need
>> to
>> > be computed"
>> >
>> > But here we found, complete block will be reindexed even with change in
>> > single child document
>> > So, we would like to know more about this feature,
>> > 1. If this complete block reindexing is heavy when compared with
>> > traditional indexing? [As we have more documents for reindexing per
>> single
>> > day i.e ~2cr]
>> > 2. What we can expect with this nested document feature in terms of
>> > performance (wrt tradeoff in indexing/querying)
>> >  3. If not, do we have any other alternative which we can work upon
>> >
>> > *Thanks & Regards,*
>> > *Uday Kumar*
>> >
>> >
>> > On Mon, Mar 3, 2025 at 7:17 PM Uday Kumar <uday.p...@indiamart.com>
>> wrote:
>> >
>> > > Also in place updates happen on very specific conditions, have you
>> > checked
>> > > you satisfy them before even attempting to see some sort of impact on
>> > your
>> > > use case?
>> > > Yes we considered those specifications, here, we didnt mean to say
>> > > it's not impactful in itself. but with our project & schema
>> > >
>> > > *Thanks & Regards,*
>> > > *Uday Kumar*
>> > > *Product Search Tech*
>> > >
>> > >
>> > > On Fri, Feb 28, 2025 at 6:06 PM Alessandro Benedetti <
>> > > benedetti.ale...@gmail.com> wrote:
>> > >
>> > >> What is your problem? Rather than asking about a solution you
>> attempted
>> > is
>> > >> usually better to start from the problem.
>> > >>
>> > >> You talk about grouping, have you considered field collapsing?
>> > >>
>> > >> According to my experience going with nested documents rarely justify
>> > the
>> > >> performance and functional overhead both at indexing and query time.
>> > >>
>> > >> But sometimes you need them.
>> > >>
>> > >> Also in place updates happen on very specific conditions, have you
>> > checked
>> > >> you satisfy them before even attempting to see some sort of impact on
>> > your
>> > >> use case?
>> > >>
>> > >> Cheers
>> > >>
>> > >> On Fri, 28 Feb 2025, 08:30 Uday Kumar, <uday.p...@indiamart.com
>> > .invalid>
>> > >> wrote:
>> > >>
>> > >> > Does this mean it will not be impactful in performance to use
>> Nested
>> > >> > Indexing in production with such an indexing rate?
>> > >> >
>> > >> > We have tried POC on inplace updates and found its not impactful
>> > either
>> > >> wrt
>> > >> > our project, so we would not be using this in combination too
>> > >> >
>> > >> > *Thanks & Regards,*
>> > >> > *Uday Kumar*
>> > >> > *Product Search Tech*
>> > >> >
>> > >> >
>> > >> > On Thu, Feb 27, 2025 at 12:31 PM Mikhail Khludnev <m...@apache.org
>> >
>> > >> wrote:
>> > >> >
>> > >> > > Changing one child rewrites the whole block period.
>> > >> > > However in-place updating child docValues is promising in theory,
>> > >> > although
>> > >> > > I don't know how it works in practice.
>> > >> > >
>> > >> > > On Thu, Feb 27, 2025 at 8:05 AM Uday Kumar <
>> uday.p...@indiamart.com
>> > >> > > .invalid>
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Hi all,
>> > >> > > > We are doing a POC on indexing nested documents in expectation
>> of
>> > >> > > reducing
>> > >> > > > grouping overhead while querying time.
>> > >> > > >
>> > >> > > > On Prod Indexing, we are using the traditional approach of
>> > >> reindexing
>> > >> > the
>> > >> > > > entire document if there is any change in any of the fields.
>> [we
>> > >> > reindex
>> > >> > > > ~2cr documents per day, FYI]
>> > >> > > > Solr Version: v9.6.1
>> > >> > > >
>> > >> > > > But I have come across a caution in solr documentation: *DOC
>> > >> > > > <
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html#:~:text=By%20way%20of%20examples%3A%20nested,%2F%20colors)%20and%20supporting%20documentation%20(
>> > >> > > > >*,
>> > >> > > > where it says: *Solr must internally reindex an entire nested
>> > >> document
>> > >> > > tree
>> > >> > > > if there are updates to it.*
>> > >> > > > Which means If a root or parent has 1000 child documents, even
>> > with
>> > >> a
>> > >> > > > change in single document  in any one of the fields, entire
>> nested
>> > >> > childs
>> > >> > > > are reindexed, which is not good enough.
>> > >> > > >
>> > >> > > > This made us rethink of performance gains that we will have, if
>> > >> nested
>> > >> > > > documents are used in production.
>> > >> > > >
>> > >> > > > If that's the case, pls let us know if there are any other
>> > solutions
>> > >> > > which
>> > >> > > > would help us in performance gains.
>> > >> > > >
>> > >> > > > *Note:*
>> > >> > > > We have already done POC on external file fields and In-Place
>> > >> updates
>> > >> > > where
>> > >> > > > we found they are not impactful for our project.
>> > >> > > >
>> > >> > > > *Thanks & Regards,*
>> > >> > > > *Uday Kumar*
>> > >> > > >
>> > >> > >
>> > >> > >
>> > >> > > --
>> > >> > > Sincerely yours
>> > >> > > Mikhail Khludnev
>> > >> > >
>> > >> >
>> > >>
>> > >
>> >
>>
>

Reply via email to