Hi all, Any help here, wrt above mentioned details?
*Thanks & Regards,* *Uday Kumar* *Product Search Tech* On Mon, Mar 17, 2025, 19:24 Uday Kumar <uday.p...@indiamart.com> wrote: > Hi, > > > *Please find details below:*In our index, we have data of suppliers along > with their products which we display on front-end, wrt search requests. > > > > *Example: For a supplier with id: 678, we have 2 products in our index* > *product-id(unique)* > *document1:* > { > product-id: 123 > product-price: 2000rs > product-name: Jute bags > > *supplier-id: 678company-name: BagFactoryLimited* > } > > *document2:* > { > product-id: 863 > product-price: 4500rs > product-name: trolley bags > > *supplier-id: 678company-name: BagFactoryLimited* > } > > As you can see from above, each document in our index contains > product details i.e product-id, product-price, product-name > and also supplier details i.e supplier-id, company-name > > *Problem1: (while indexing)* > Here, whenever there is a change in supplier specific details/field, we > are re-indexing all the products of the supplier although the supplier data > will be the same in all of his products. > *FYI* > We re-index ~5Cr documents per day > > > *We would like to know, if there is any better way to optimize this which > helps to avoid indexing of redundant data* > *Problem2: (while querying)* > Now, when the data in our current index is queried, we display the single > most relevant product of a supplier. [even if the query matches 1 or more > documents in our index] > > For this we are using a collapse query on supplier-id field (as we dont > know relationship between documents) [which is resource intensive] > *Ex: * > fq={!collapse field=supplier-id} > > *FYI* > We serve ~25 Lakh Queries per day > > *We would like to know if there is any better way to organize index, so > that we can avoid such resource intensive queries, thereby optimizing > search response* > > *Our Solr Infra Stats: FYI* > *Version:* v9.6.1 > *No. of nodes:* 8 > *No. of shards:* 62 > *Heap per node: *12G > *RAM per node: *50G > *No. of cpu cores per node: *16 > *Count of docs:* ~20Cr > *Size of Index: *~250G > *Routing used:* implicit > > Please let us know, if you need any other details > > *Thanks & Regards,* > *Uday Kumar* > > On Wed, Mar 5, 2025 at 12:08 PM Alessandro Benedetti <a.benede...@sease.io> > wrote: > >> Hi Uday, >> Your email is a perfect example of >> https://en.m.wikipedia.org/wiki/XY_problem. >> >> Both for indexing and query time you need to explain your problems and use >> cases rather than your attempted solutions. >> >> >> Then we'll be able to give some recommendations. >> >> >> On Wed, 5 Mar 2025, 06:39 Uday Kumar, <uday.p...@indiamart.com.invalid> >> wrote: >> >> > Hi, >> > I would like to give some extra context here, so that it would help in >> > getting better suggestions >> > >> > >> > *Our goal:To improve our search system either by optimizing indexing or >> by >> > improving solr response times* >> > >> > *Current approach while indexing at our end:* >> > Even with change in a single field of document, we send the entire >> document >> > for indexing. (~2cr docs are being reindexed on a daily basis) >> > Solr version: V9.6.1 >> > >> > *To Optimize Indexing:* >> > 1. POC on external file field: [which stores frequently changed fields >> in >> > external file and loads after each commit, instead of indexing into solr >> > for each change] >> > >> > >> https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html >> > Observation: >> > a. works only with numeric fields >> > b. Also the community suggested not to go with this, as its old feature. >> > so, I dropped this. >> > >> > 2. POC on Inplace update: (Which helps in indexing fields which contains >> > changes, but not entire document) >> > >> > >> https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html#in-place-updates >> > Observation: >> > a. Works with only single values fields >> > b. Looks promising wrt indexing optimization but not suitable wrt our >> > schema (as we have more multivalued fields). so, dropped >> > >> > >> > Then we moved for alternatives which is expected to help in optimizing >> > response times >> > >> > *To improve Solr Response time:*Nested Documents POC: >> > >> > >> https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html >> > *wrt this statement:* >> > "In terms of performance,* indexing the relationships between documents >> > usually yields much faster queries* than an equivalent "query time >> join", >> > since the relationships are already stored in the index and do not need >> to >> > be computed" >> > >> > But here we found, complete block will be reindexed even with change in >> > single child document >> > So, we would like to know more about this feature, >> > 1. If this complete block reindexing is heavy when compared with >> > traditional indexing? [As we have more documents for reindexing per >> single >> > day i.e ~2cr] >> > 2. What we can expect with this nested document feature in terms of >> > performance (wrt tradeoff in indexing/querying) >> > 3. If not, do we have any other alternative which we can work upon >> > >> > *Thanks & Regards,* >> > *Uday Kumar* >> > >> > >> > On Mon, Mar 3, 2025 at 7:17 PM Uday Kumar <uday.p...@indiamart.com> >> wrote: >> > >> > > Also in place updates happen on very specific conditions, have you >> > checked >> > > you satisfy them before even attempting to see some sort of impact on >> > your >> > > use case? >> > > Yes we considered those specifications, here, we didnt mean to say >> > > it's not impactful in itself. but with our project & schema >> > > >> > > *Thanks & Regards,* >> > > *Uday Kumar* >> > > *Product Search Tech* >> > > >> > > >> > > On Fri, Feb 28, 2025 at 6:06 PM Alessandro Benedetti < >> > > benedetti.ale...@gmail.com> wrote: >> > > >> > >> What is your problem? Rather than asking about a solution you >> attempted >> > is >> > >> usually better to start from the problem. >> > >> >> > >> You talk about grouping, have you considered field collapsing? >> > >> >> > >> According to my experience going with nested documents rarely justify >> > the >> > >> performance and functional overhead both at indexing and query time. >> > >> >> > >> But sometimes you need them. >> > >> >> > >> Also in place updates happen on very specific conditions, have you >> > checked >> > >> you satisfy them before even attempting to see some sort of impact on >> > your >> > >> use case? >> > >> >> > >> Cheers >> > >> >> > >> On Fri, 28 Feb 2025, 08:30 Uday Kumar, <uday.p...@indiamart.com >> > .invalid> >> > >> wrote: >> > >> >> > >> > Does this mean it will not be impactful in performance to use >> Nested >> > >> > Indexing in production with such an indexing rate? >> > >> > >> > >> > We have tried POC on inplace updates and found its not impactful >> > either >> > >> wrt >> > >> > our project, so we would not be using this in combination too >> > >> > >> > >> > *Thanks & Regards,* >> > >> > *Uday Kumar* >> > >> > *Product Search Tech* >> > >> > >> > >> > >> > >> > On Thu, Feb 27, 2025 at 12:31 PM Mikhail Khludnev <m...@apache.org >> > >> > >> wrote: >> > >> > >> > >> > > Changing one child rewrites the whole block period. >> > >> > > However in-place updating child docValues is promising in theory, >> > >> > although >> > >> > > I don't know how it works in practice. >> > >> > > >> > >> > > On Thu, Feb 27, 2025 at 8:05 AM Uday Kumar < >> uday.p...@indiamart.com >> > >> > > .invalid> >> > >> > > wrote: >> > >> > > >> > >> > > > Hi all, >> > >> > > > We are doing a POC on indexing nested documents in expectation >> of >> > >> > > reducing >> > >> > > > grouping overhead while querying time. >> > >> > > > >> > >> > > > On Prod Indexing, we are using the traditional approach of >> > >> reindexing >> > >> > the >> > >> > > > entire document if there is any change in any of the fields. >> [we >> > >> > reindex >> > >> > > > ~2cr documents per day, FYI] >> > >> > > > Solr Version: v9.6.1 >> > >> > > > >> > >> > > > But I have come across a caution in solr documentation: *DOC >> > >> > > > < >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html#:~:text=By%20way%20of%20examples%3A%20nested,%2F%20colors)%20and%20supporting%20documentation%20( >> > >> > > > >*, >> > >> > > > where it says: *Solr must internally reindex an entire nested >> > >> document >> > >> > > tree >> > >> > > > if there are updates to it.* >> > >> > > > Which means If a root or parent has 1000 child documents, even >> > with >> > >> a >> > >> > > > change in single document in any one of the fields, entire >> nested >> > >> > childs >> > >> > > > are reindexed, which is not good enough. >> > >> > > > >> > >> > > > This made us rethink of performance gains that we will have, if >> > >> nested >> > >> > > > documents are used in production. >> > >> > > > >> > >> > > > If that's the case, pls let us know if there are any other >> > solutions >> > >> > > which >> > >> > > > would help us in performance gains. >> > >> > > > >> > >> > > > *Note:* >> > >> > > > We have already done POC on external file fields and In-Place >> > >> updates >> > >> > > where >> > >> > > > we found they are not impactful for our project. >> > >> > > > >> > >> > > > *Thanks & Regards,* >> > >> > > > *Uday Kumar* >> > >> > > > >> > >> > > >> > >> > > >> > >> > > -- >> > >> > > Sincerely yours >> > >> > > Mikhail Khludnev >> > >> > > >> > >> > >> > >> >> > > >> > >> >