Thank you Eric for your reply. Please find my response below. > 1. Were we right to assume to separate indexing and query layer? is it a > > good idea? or something else could have been done better? because right > > now it can affect our cluster stability, if in case replica node is not > > available then queries will start going to indexing node, which is very > > weak and it could choke the whole cluster. > This is a good question.. since eventually your updates go from your > leaders to all your replicas, I would start with just making all nodes the > same size, and not try to have two layers. I think, as you learn more, > maybe you could come up with more exotic layouts, but for now, just have > every node the same size, and let every node do the work. The only reason > to pull out your leaders is if they somehow do EXTRA work on indexing that > other nodes wouldn’t do…. >
Our application involves lots of daily updates in data. We regularly update approx 40-50%(~50 million) and we index it continuously. Our replica's of pull type. So our indexing nodes are actually doing a lot of work that others are not. this was the reason we decide to separate those layers to maximize the performance of the query layer. > 2. is there any guideline for the number of shards and shards size? > If your queries are taking a long time due to the volume of data to query > over, then shard. If you previously had a single leader/follower (the new > terms for master/slave) and didn’t have performance issues, then I would > have 1 shard, with 1 leader and 4 replicas. That would most closely > replicate your previous setup. We were actually facing performance issues. though our response time was better than cloud it lacked stability. We had to optimize it daily and only one time replication of all the indexed data, so we sacrificed real-time searching for stability. some queries took a lot of time to execute with more results. Hence we decide to move to the cloud and thought of doing sharding. We tried many combinations of the number of shards and found that 20-25 gb shards gave us the closest response time to the earlier. > 3. How to decide the ideal number of CPUs to have per node? is there any > > metric we can follow like load or CPU usage? > > what should be the ideal CPU usages and load average based on the number > of > > CPU ? > > because our response time increases exponentially with the traffic. 250 > ms > > to 400 ms in peak hours. Peak hour traffic remains at 2000 requests per > > minute. cpu usages at 55% and load average at ~6(10 cpu) > Lots of great stuff around Grafana and other tooling to get data, but I > don’t have a specific answer. Yes, we do have detailed dashboards for all the metrics. but I was looking for some general guidelines that if your load or CPU usage is more than some number then you probably need to increase the core count or any other defined equation. It becomes easier to fine-tune the system and convince the infra team to allocate more resources. Because we have seen that when load/CPU usage is less in the offpeak hours, response time is significantly low. 5 how to handle dev and stage environments, should we have other smaller > > clusters or any other approach? > On dev and stage, since having more replicas is to support volume of > queries, then I think you are okay with having just 1 leader and 1 > follower, or even just having the leader. Now, if you need to shard to > support your query cause it takes a long time, then you can do that. So we do need to have different small clusters (master only or fewer replicas.) It will require that either we use our primary indexing process to index data on all clusters, which might make it a little slow, or a different indexing process to index everything for each cluster, this will lead to different data in each cluster and different results in calibration. I was thinking if there was some way we could sync the data from the primary cluster without hampering its performance. > 7. How do you maintain versioning of config in zookeeper? > I bootstrap my configset and use the api’s. > https://github.com/querqy/chorus/blob/main/quickstart.sh#L119 < > https://github.com/querqy/chorus/blob/main/quickstart.sh#L119> for an > example. Thank you, I will check this. On Thu, Sep 8, 2022 at 11:43 PM Eric Pugh <ep...@opensourceconnections.com> wrote: > Lots of good questions here, I’ll inline a couple of answers... > > > On Sep 8, 2022, at 1:59 AM, Satya Nand <satya.n...@indiamart.com.INVALID> > wrote: > > > > Hi All, > > > > We have recently moved from solr 6.5 to solr cloud 8.10. > > > > > > *Earlier Architecture:*We were using a master-slave architecture where we > > had 4 slaves(14 cpu, 96 GB ram, 20 GB Heap, 110 GB index size). We used > to > > optimize and replicate nightly. > > > > *Now.* > > We didn't have a clear direction on the number of shards. So we did some > > POC with variable numbers of shards. We found that with 8 shards we were > > close to the response time we were getting earlier without using too much > > infrastructure. > > Based on our queries we couldn't find a routing parameter so now all > > queries are being broadcasted to every shard. > > > > Now, we have 8+1 solr nodes cluster. Where 1 Indexing node contains > all(8) > > NRT Primary shards. This is where all indexing happens. Then We have > > another 8 nodes each having ( 10 cpu, 42 GB ram,8 GB heap ~23 GB Index) > > consisting of one pull replica of each primary shard. For querying, we > have > > used *shard.preference as PULL *so that all queries are returned from > pull > > replicas. > > > > Our thought process was that we should have the indexing layer and query > > layer separate so one does not affect the other. > > > > we made it live this week. Though it didn't help in reducing the response > > time, in fact, we found an increase in average response time. We found a > > substantial impact on response time after 85 percentile response time, So > > timeouts reduced significantly. > > > > *Now I have a few questions for all the guys who are using solr cloud to > > help me understand and increase the stability of my cluster. * > > > > 1. Were we right to assume to separate indexing and query layer? is it a > > good idea? or something else could have been done better? because right > > now it can affect our cluster stability, if in case replica node is not > > available then queries will start going to indexing node, which is very > > weak and it could choke the whole cluster. > This is a good question.. since eventually your updates go from your > leaders to all your replicas, I would start with just making all nodes the > same size, and not try to have two layers. I think, as you learn more, > maybe you could come up with more exotic layouts, but for now, just have > every node the same size, and let every node do the work. The only reason > to pull out your leaders is if they somehow do EXTRA work on indexing that > other nodes wouldn’t do…. > > > > 2. is there any guideline for the number of shards and shards size? > If your queries are taking a long time due to the volume of data to query > over, then shard. If you previously had a single leader/follower (the new > terms for master/slave) and didn’t have performance issues, then I would > have 1 shard, with 1 leader and 4 replicas. That would most closely > replicate your previous setup. > > > > > 3. How to decide the ideal number of CPUs to have per node? is there any > > metric we can follow like load or CPU usage? > > what should be the ideal CPU usages and load average based on the number > of > > CPU ? > > because our response time increases exponentially with the traffic. 250 > ms > > to 400 ms in peak hours. Peak hour traffic remains at 2000 requests per > > minute. cpu usages at 55% and load average at ~6(10 cpu) > Lots of great stuff around Grafana and other tooling to get data, but I > don’t have a specific answer. > > > > 4. How do decide the number of nodes based on shards or any other metric? > > should one increase nodes or CPUs on existing nodes? > > > > > 5 how to handle dev and stage environments, should we have other smaller > > clusters or any other approach? > On dev and stage, since having more replicas is to support volume of > queries, then I think you are okay with having just 1 leader and 1 > follower, or even just having the leader. Now, if you need to shard to > support your query cause it takes a long time, then you can do that. > > > > 6. Did your infrastructure requirement also increase compared to > standalone > > when moving to the cloud, if yes then how much? > > > > 7. How do you maintain versioning of config in zookeeper? > I bootstrap my configset and use the api’s. > https://github.com/querqy/chorus/blob/main/quickstart.sh#L119 < > https://github.com/querqy/chorus/blob/main/quickstart.sh#L119> for an > example. > > > 8, any performance issue you faced or any other recommendation? > > _______________________ > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > http://www.opensourceconnections.com < > http://www.opensourceconnections.com/> | My Free/Busy < > http://tinyurl.com/eric-cal> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless of > whether attachments are marked as such. > >