Hi,
*Please find details below:*In our index, we have data of suppliers along with their products which we display on front-end, wrt search requests. *Example: For a supplier with id: 678, we have 2 products in our index* *product-id(unique)* *document1:* { product-id: 123 product-price: 2000rs product-name: Jute bags *supplier-id: 678company-name: BagFactoryLimited* } *document2:* { product-id: 863 product-price: 4500rs product-name: trolley bags *supplier-id: 678company-name: BagFactoryLimited* } As you can see from above, each document in our index contains product details i.e product-id, product-price, product-name and also supplier details i.e supplier-id, company-name *Problem1: (while indexing)* Here, whenever there is a change in supplier specific details/field, we are re-indexing all the products of the supplier although the supplier data will be the same in all of his products. *FYI* We re-index ~5Cr documents per day *We would like to know, if there is any better way to optimize this which helps to avoid indexing of redundant data* *Problem2: (while querying)* Now, when the data in our current index is queried, we display the single most relevant product of a supplier. [even if the query matches 1 or more documents in our index] For this we are using a collapse query on supplier-id field (as we dont know relationship between documents) [which is resource intensive] *Ex: * fq={!collapse field=supplier-id} *FYI* We serve ~25 Lakh Queries per day *We would like to know if there is any better way to organize index, so that we can avoid such resource intensive queries, thereby optimizing search response* *Our Solr Infra Stats: FYI* *Version:* v9.6.1 *No. of nodes:* 8 *No. of shards:* 62 *Heap per node: *12G *RAM per node: *50G *No. of cpu cores per node: *16 *Count of docs:* ~20Cr *Size of Index: *~250G *Routing used:* implicit Please let us know, if you need any other details *Thanks & Regards,* *Uday Kumar* On Wed, Mar 5, 2025 at 12:08 PM Alessandro Benedetti <a.benede...@sease.io> wrote: > Hi Uday, > Your email is a perfect example of > https://en.m.wikipedia.org/wiki/XY_problem. > > Both for indexing and query time you need to explain your problems and use > cases rather than your attempted solutions. > > > Then we'll be able to give some recommendations. > > > On Wed, 5 Mar 2025, 06:39 Uday Kumar, <uday.p...@indiamart.com.invalid> > wrote: > > > Hi, > > I would like to give some extra context here, so that it would help in > > getting better suggestions > > > > > > *Our goal:To improve our search system either by optimizing indexing or > by > > improving solr response times* > > > > *Current approach while indexing at our end:* > > Even with change in a single field of document, we send the entire > document > > for indexing. (~2cr docs are being reindexed on a daily basis) > > Solr version: V9.6.1 > > > > *To Optimize Indexing:* > > 1. POC on external file field: [which stores frequently changed fields in > > external file and loads after each commit, instead of indexing into solr > > for each change] > > > > > https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html > > Observation: > > a. works only with numeric fields > > b. Also the community suggested not to go with this, as its old feature. > > so, I dropped this. > > > > 2. POC on Inplace update: (Which helps in indexing fields which contains > > changes, but not entire document) > > > > > https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html#in-place-updates > > Observation: > > a. Works with only single values fields > > b. Looks promising wrt indexing optimization but not suitable wrt our > > schema (as we have more multivalued fields). so, dropped > > > > > > Then we moved for alternatives which is expected to help in optimizing > > response times > > > > *To improve Solr Response time:*Nested Documents POC: > > > > > https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html > > *wrt this statement:* > > "In terms of performance,* indexing the relationships between documents > > usually yields much faster queries* than an equivalent "query time join", > > since the relationships are already stored in the index and do not need > to > > be computed" > > > > But here we found, complete block will be reindexed even with change in > > single child document > > So, we would like to know more about this feature, > > 1. If this complete block reindexing is heavy when compared with > > traditional indexing? [As we have more documents for reindexing per > single > > day i.e ~2cr] > > 2. What we can expect with this nested document feature in terms of > > performance (wrt tradeoff in indexing/querying) > > 3. If not, do we have any other alternative which we can work upon > > > > *Thanks & Regards,* > > *Uday Kumar* > > > > > > On Mon, Mar 3, 2025 at 7:17 PM Uday Kumar <uday.p...@indiamart.com> > wrote: > > > > > Also in place updates happen on very specific conditions, have you > > checked > > > you satisfy them before even attempting to see some sort of impact on > > your > > > use case? > > > Yes we considered those specifications, here, we didnt mean to say > > > it's not impactful in itself. but with our project & schema > > > > > > *Thanks & Regards,* > > > *Uday Kumar* > > > *Product Search Tech* > > > > > > > > > On Fri, Feb 28, 2025 at 6:06 PM Alessandro Benedetti < > > > benedetti.ale...@gmail.com> wrote: > > > > > >> What is your problem? Rather than asking about a solution you > attempted > > is > > >> usually better to start from the problem. > > >> > > >> You talk about grouping, have you considered field collapsing? > > >> > > >> According to my experience going with nested documents rarely justify > > the > > >> performance and functional overhead both at indexing and query time. > > >> > > >> But sometimes you need them. > > >> > > >> Also in place updates happen on very specific conditions, have you > > checked > > >> you satisfy them before even attempting to see some sort of impact on > > your > > >> use case? > > >> > > >> Cheers > > >> > > >> On Fri, 28 Feb 2025, 08:30 Uday Kumar, <uday.p...@indiamart.com > > .invalid> > > >> wrote: > > >> > > >> > Does this mean it will not be impactful in performance to use Nested > > >> > Indexing in production with such an indexing rate? > > >> > > > >> > We have tried POC on inplace updates and found its not impactful > > either > > >> wrt > > >> > our project, so we would not be using this in combination too > > >> > > > >> > *Thanks & Regards,* > > >> > *Uday Kumar* > > >> > *Product Search Tech* > > >> > > > >> > > > >> > On Thu, Feb 27, 2025 at 12:31 PM Mikhail Khludnev <m...@apache.org> > > >> wrote: > > >> > > > >> > > Changing one child rewrites the whole block period. > > >> > > However in-place updating child docValues is promising in theory, > > >> > although > > >> > > I don't know how it works in practice. > > >> > > > > >> > > On Thu, Feb 27, 2025 at 8:05 AM Uday Kumar < > uday.p...@indiamart.com > > >> > > .invalid> > > >> > > wrote: > > >> > > > > >> > > > Hi all, > > >> > > > We are doing a POC on indexing nested documents in expectation > of > > >> > > reducing > > >> > > > grouping overhead while querying time. > > >> > > > > > >> > > > On Prod Indexing, we are using the traditional approach of > > >> reindexing > > >> > the > > >> > > > entire document if there is any change in any of the fields. [we > > >> > reindex > > >> > > > ~2cr documents per day, FYI] > > >> > > > Solr Version: v9.6.1 > > >> > > > > > >> > > > But I have come across a caution in solr documentation: *DOC > > >> > > > < > > >> > > > > > >> > > > > >> > > > >> > > > https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html#:~:text=By%20way%20of%20examples%3A%20nested,%2F%20colors)%20and%20supporting%20documentation%20( > > >> > > > >*, > > >> > > > where it says: *Solr must internally reindex an entire nested > > >> document > > >> > > tree > > >> > > > if there are updates to it.* > > >> > > > Which means If a root or parent has 1000 child documents, even > > with > > >> a > > >> > > > change in single document in any one of the fields, entire > nested > > >> > childs > > >> > > > are reindexed, which is not good enough. > > >> > > > > > >> > > > This made us rethink of performance gains that we will have, if > > >> nested > > >> > > > documents are used in production. > > >> > > > > > >> > > > If that's the case, pls let us know if there are any other > > solutions > > >> > > which > > >> > > > would help us in performance gains. > > >> > > > > > >> > > > *Note:* > > >> > > > We have already done POC on external file fields and In-Place > > >> updates > > >> > > where > > >> > > > we found they are not impactful for our project. > > >> > > > > > >> > > > *Thanks & Regards,* > > >> > > > *Uday Kumar* > > >> > > > > > >> > > > > >> > > > > >> > > -- > > >> > > Sincerely yours > > >> > > Mikhail Khludnev > > >> > > > > >> > > > >> > > > > > >