Firstly, Ignite isn’t a product in the same way that Coherence is. It’s a community driven project. If you’d like commercial support there are options available.
Deploying Ignite across availability zones is pretty common, and single digit round-trip latency between zones is on the high side of reasonable. (Though the definition of “region” and “zone” does vary between providers.) So I don’t think there’s anything fundamentally wrong with what you’re doing. The question is how is your installation different from others? You’ve not shared any logs or configurations so all we can really do is guess. The queues building up suggest you have a throughput problem. Could it be a network bandwidth or reliability issue? > On 11 Mar 2023, at 00:01, Vicky <vicky...@gmail.com> wrote: > > Thanks, Jeremy I will take a look at this. > > I am sure there must be such benchmarks, just like every product publishes > its CPU and memory requirements. If a product entirely depends on the > underlying n/w, especially a clustering-based one should definitely publish > its n/w requirements. I believe long back Coherence (a distributed Cache) by > Tangosol (now owned by Oracle) did provide some of those details. > > Right now, as I explained in the thread, we have a cluster setup across 3 > zones within a region, and we are seeing issues when one of the nodes is set > up in an AZ whose latency is > 0.8 ms from the other 2 AZ's. So before we > look around and research any further tuning options than what we have already > tried it would help if some of these n/w requirements were already published, > that way we know that tuning further would help. > > On Thu, Mar 9, 2023 at 4:47 PM Jeremy McMillan <jeremy.mcmil...@gridgain.com > <mailto:jeremy.mcmil...@gridgain.com>> wrote: >> Has this kind of benchmark ever been published for any p2p cluster >> technology? >> >> What questions would it answer if there were such benchmarks for Ignite? >> >> Maybe this will help: >> >> There is an established algorithm for estimating the amount of buffer space >> necessary to keep a pipeline from stuttering during congestion. A generation >> ago this was a big deal because most Linux distros shipped with TCP buffer >> configuration that was insufficient for the rapidly improving network >> performance of Ethernet and broadband Internet service. The same idea >> generalizes for any streaming network communication, not only TCP. >> >> https://en.m.wikipedia.org/wiki/Bandwidth-delay_product >> >> Your infrastructure provider should be able to provide you with optimistic >> bandwidth numbers. Decide how much latency you need to tolerate. For best >> results, collect ping statistics over a long time to get realistic latency >> expectations. Plug that into the formula. >> >> To prevent buffer underruns and overruns, allocate buffer space for double >> the BDP, as a rule of thumb. For best results, instrument the buffers and >> collect statistics under various load scenarios and adjust as necessary. >> >> This will only solve sporadic latency hiccups. Some of this traffic will >> affect lock contention, so dealing with poor network performance isn't just >> a buffering issue. Expect to find, investigate, and solve new issues after >> you get rid of the buffering exceptions. >> >> Good luck, and please let us know how things work for you. >> >> On Thu, Mar 9, 2023, 17:08 Vicky <vicky...@gmail.com >> <mailto:vicky...@gmail.com>> wrote: >>> Thanks, Sumit. I've gone through these, but I don't see any mention of >>> latency between two boxes within a cluster. Has any cloud-based >>> benchmarking been done? More specifically when a single cluster is spread >>> across multiple AZ's within the same region. >>> >>> On Wed, Mar 8, 2023 at 10:33 PM Sumit Deshinge <sumit.deshi...@gmail.com >>> <mailto:sumit.deshi...@gmail.com>> wrote: >>>> Please check if these benchmark documents can help you : >>>> 1. Apache Ignite and Apache Cassandra benchmarks >>>> <https://www.gridgain.com/resources/blog/apacher-ignitetm-and-apacher-cassandratm-benchmarks-power-in-memory-computing> >>>> 2. Gridgain benchmark results >>>> <https://www.gridgain.com/resources/benchmarks/gridgain-benchmarks-results> >>>> >>>> You can also go through performance tips available on the official site at: >>>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/general-perf-tips >>>> >>>> On Wed, Mar 8, 2023 at 3:51 AM Vicky <vicky...@gmail.com >>>> <mailto:vicky...@gmail.com>> wrote: >>>>> Hi, >>>>> Is there any benchmarking about what is an acceptable latency between >>>>> nodes for an Ignite cluster to function stably? >>>>> >>>>> We are currently having a single cluster across AZ's (same region). The >>>>> AZ latency published by the cloud provider is ~0.4-1ms. >>>>> >>>>> What we have observed is for boxes where the AZ latency is larger i.e. > >>>>> 0.8, we start seeing server engine memory growing exponentially. We >>>>> controlled that by setting the msg queue and slow client limits to 1024 & >>>>> 1023 respectively. This helped get the memory in check. >>>>> >>>>> However now we are seeing client nodes failing with "Client node outbound >>>>> message queue size exceeded slowClientQueueLimit, the client will be >>>>> dropped (consider changing 'slowClientQueueLimit' configuration >>>>> property)". >>>>> >>>>> This results in continuous disconnect and reconnect happening on these >>>>> client nodes and subsequently no processing going through. >>>>> >>>>> Is there any benchmarking done for Ignite or documents available which >>>>> say, for a stable ignite cluster the latency between nodes cannot be > x >>>>> ms? >>>>> >>>>> However, if this is indeed our application issue then I would like to >>>>> understand how to troubleshoot or get around this issue. >>>>> >>>>> Thanks >>>>> Victor >>>> >>>> >>>> -- >>>> Regards, >>>> Sumit Deshinge >>>>