Thanks, Jeremy I will take a look at this. I am sure there must be such benchmarks, just like every product publishes its CPU and memory requirements. If a product entirely depends on the underlying n/w, especially a clustering-based one should definitely publish its n/w requirements. I believe long back Coherence (a distributed Cache) by Tangosol (now owned by Oracle) did provide some of those details.
Right now, as I explained in the thread, we have a cluster setup across 3 zones within a region, and we are seeing issues when one of the nodes is set up in an AZ whose latency is > 0.8 ms from the other 2 AZ's. So before we look around and research any further tuning options than what we have already tried it would help if some of these n/w requirements were already published, that way we know that tuning further would help. On Thu, Mar 9, 2023 at 4:47 PM Jeremy McMillan <jeremy.mcmil...@gridgain.com> wrote: > Has this kind of benchmark ever been published for any p2p cluster > technology? > > What questions would it answer if there were such benchmarks for Ignite? > > Maybe this will help: > > There is an established algorithm for estimating the amount of buffer > space necessary to keep a pipeline from stuttering during congestion. A > generation ago this was a big deal because most Linux distros shipped with > TCP buffer configuration that was insufficient for the rapidly improving > network performance of Ethernet and broadband Internet service. The same > idea generalizes for any streaming network communication, not only TCP. > > https://en.m.wikipedia.org/wiki/Bandwidth-delay_product > > Your infrastructure provider should be able to provide you with optimistic > bandwidth numbers. Decide how much latency you need to tolerate. For best > results, collect ping statistics over a long time to get realistic latency > expectations. Plug that into the formula. > > To prevent buffer underruns and overruns, allocate buffer space for double > the BDP, as a rule of thumb. For best results, instrument the buffers and > collect statistics under various load scenarios and adjust as necessary. > > This will only solve sporadic latency hiccups. Some of this traffic will > affect lock contention, so dealing with poor network performance isn't just > a buffering issue. Expect to find, investigate, and solve new issues after > you get rid of the buffering exceptions. > > Good luck, and please let us know how things work for you. > > On Thu, Mar 9, 2023, 17:08 Vicky <vicky...@gmail.com> wrote: > >> Thanks, Sumit. I've gone through these, but I don't see any mention of >> latency between two boxes within a cluster. Has any >> cloud-based benchmarking been done? More specifically when a single cluster >> is spread across multiple AZ's within the same region. >> >> On Wed, Mar 8, 2023 at 10:33 PM Sumit Deshinge <sumit.deshi...@gmail.com> >> wrote: >> >>> Please check if these benchmark documents can help you : >>> 1. Apache Ignite and Apache Cassandra benchmarks >>> <https://www.gridgain.com/resources/blog/apacher-ignitetm-and-apacher-cassandratm-benchmarks-power-in-memory-computing> >>> 2. Gridgain benchmark results >>> <https://www.gridgain.com/resources/benchmarks/gridgain-benchmarks-results> >>> >>> You can also go through performance tips available on the official site >>> at: >>> >>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/general-perf-tips >>> >>> On Wed, Mar 8, 2023 at 3:51 AM Vicky <vicky...@gmail.com> wrote: >>> >>>> Hi, >>>> Is there any benchmarking about what is an acceptable latency between >>>> nodes for an Ignite cluster to function stably? >>>> >>>> We are currently having a single cluster across AZ's (same region). The >>>> AZ latency published by the cloud provider is ~0.4-1ms. >>>> >>>> What we have observed is for boxes where the AZ latency is larger i.e. >>>> > 0.8, we start seeing server engine memory growing exponentially. We >>>> controlled that by setting the msg queue and slow client limits to 1024 & >>>> 1023 respectively. This helped get the memory in check. >>>> >>>> However now we are seeing client nodes failing with "Client node >>>> outbound message queue size exceeded slowClientQueueLimit, the client will >>>> be dropped (consider changing 'slowClientQueueLimit' configuration >>>> property)". >>>> >>>> This results in continuous disconnect and reconnect happening on these >>>> client nodes and subsequently no processing going through. >>>> >>>> Is there any benchmarking done for Ignite or documents available which >>>> say, for a stable ignite cluster the latency between nodes cannot be > x >>>> ms? >>>> >>>> However, if this is indeed our application issue then I would like to >>>> understand how to troubleshoot or get around this issue. >>>> >>>> Thanks >>>> Victor >>>> >>> >>> >>> -- >>> Regards, >>> Sumit Deshinge >>> >>>