Re: Ignite Cluster issues with larger latency between nodes

Vicky Fri, 10 Mar 2023 16:01:33 -0800

Thanks, Jeremy I will take a look at this.

I am sure there must be such benchmarks, just like every product publishes
its CPU and memory requirements. If a product entirely depends on the
underlying n/w, especially a clustering-based one should definitely publish
its n/w requirements. I believe long back Coherence (a distributed Cache)
by Tangosol (now owned by Oracle) did provide some of those details.


Right now, as I explained in the thread, we have a cluster setup across 3
zones within a region, and we are seeing issues when one of the nodes is
set up in an AZ whose latency is > 0.8 ms from the other 2 AZ's. So before
we look around and research any further tuning options than what we have
already tried it would help if some of these n/w requirements were already
published, that way we know that tuning further would help.

On Thu, Mar 9, 2023 at 4:47 PM Jeremy McMillan <jeremy.mcmil...@gridgain.com>
wrote:

> Has this kind of benchmark ever been published for any p2p cluster
> technology?
>
> What questions would it answer if there were such benchmarks for Ignite?
>
> Maybe this will help:
>
> There is an established algorithm for estimating the amount of buffer
> space necessary to keep a pipeline from stuttering during congestion. A
> generation ago this was a big deal because most Linux distros shipped with
> TCP buffer configuration that was insufficient for the rapidly improving
> network performance of Ethernet and broadband Internet service. The same
> idea generalizes for any streaming network communication, not only TCP.
>
> https://en.m.wikipedia.org/wiki/Bandwidth-delay_product
>
> Your infrastructure provider should be able to provide you with optimistic
> bandwidth numbers. Decide how much latency you need to tolerate. For best
> results, collect ping statistics over a long time to get realistic latency
> expectations. Plug that into the formula.
>
> To prevent buffer underruns and overruns, allocate buffer space for double
> the BDP, as a rule of thumb. For best results, instrument the buffers and
> collect statistics under various load scenarios and adjust as necessary.
>
> This will only solve sporadic latency hiccups. Some of this traffic will
> affect lock contention, so dealing with poor network performance isn't just
> a buffering issue. Expect to find, investigate, and solve new issues after
> you get rid of the buffering exceptions.
>
> Good luck, and please let us know how things work for you.
>
> On Thu, Mar 9, 2023, 17:08 Vicky <vicky...@gmail.com> wrote:
>
>> Thanks, Sumit. I've gone through these, but I don't see any mention of
>> latency between two boxes within a cluster. Has any
>> cloud-based benchmarking been done? More specifically when a single cluster
>> is spread across multiple AZ's within the same region.
>>
>> On Wed, Mar 8, 2023 at 10:33 PM Sumit Deshinge <sumit.deshi...@gmail.com>
>> wrote:
>>
>>> Please check if these benchmark documents can help you :
>>> 1. Apache Ignite and Apache Cassandra benchmarks
>>> <https://www.gridgain.com/resources/blog/apacher-ignitetm-and-apacher-cassandratm-benchmarks-power-in-memory-computing>
>>> 2. Gridgain benchmark results
>>> <https://www.gridgain.com/resources/benchmarks/gridgain-benchmarks-results>
>>>
>>> You can also go through performance tips available on the official site
>>> at:
>>>
>>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/general-perf-tips
>>>
>>> On Wed, Mar 8, 2023 at 3:51 AM Vicky <vicky...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Is there any benchmarking about what is an acceptable latency between
>>>> nodes for an Ignite cluster to function stably?
>>>>
>>>> We are currently having a single cluster across AZ's (same region). The
>>>> AZ latency published by the cloud provider is ~0.4-1ms.
>>>>
>>>> What we have observed is for boxes where the AZ latency is larger i.e.
>>>> > 0.8, we start seeing server engine memory growing exponentially. We
>>>> controlled that by setting the msg queue and slow client limits to 1024 &
>>>> 1023 respectively. This helped get the memory in check.
>>>>
>>>> However now we are seeing client nodes failing with "Client node
>>>> outbound message queue size exceeded slowClientQueueLimit, the client will
>>>> be dropped (consider changing 'slowClientQueueLimit' configuration
>>>> property)".
>>>>
>>>> This results in continuous disconnect and reconnect happening on these
>>>> client nodes and subsequently no processing going through.
>>>>
>>>> Is there any benchmarking done for Ignite or documents available which
>>>> say, for a stable ignite cluster the latency between nodes cannot be > x 
>>>> ms?
>>>>
>>>> However, if this is indeed our application issue then I would like to
>>>> understand how to troubleshoot or get around this issue.
>>>>
>>>> Thanks
>>>> Victor
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Sumit Deshinge
>>>
>>>

Re: Ignite Cluster issues with larger latency between nodes

Reply via email to