This is sufficiently atypical that many people aren't going to have enough
intuition to figure it out without seeing your metrics / logs / debugging
data (e.g. heap dumps).

My only guess, and it's a pretty big guess, is that your write timeout is
low enough  (or network quality bad enough, though unlikely with gcp,
they're usually very good at networking) that your coordinator is timing
out waiting for india to ack the write, which is causing it to write a
hint, and the delivery of that hint also times out due to (network or
latency or just speed of light), so you're creating a death spiral of write
-> write is slow so hint -> hint replay now means youre' doing extra work,
so it's even slower, so new writes also hint, and maybe hints re-deliver.

The other option would be that system_traces is only in india, and someone
enabled tracing (either application level or probabilistically).

If it's neither of those, you're going to have to debug it for real. Do
metrics show hints? Do you have table level metrics for writes per table?
Are they the same as you expect? Are they higher in india? Is one table
only in india and it's getting lots of writes? Take a heap dump of one of
the india machines and look to see what the mutations are. Is it a table
you recognize? Is the replication factor set the way you expect?


On Tue, Jul 20, 2021 at 10:20 AM MyWorld <timeplus.1...@gmail.com> wrote:

> Kindly help in this regard. What could be the possible reason for load and
> mutation spike in india data center
>
> On 2021/07/20 00:14:56 MyWorld wrote:
> > Hi Arvinder,
> > It's a separate cluster. Here max partition size is 32mb.
> >
> > On 2021/07/19 23:57:27 Arvinder Dhillon wrote:
> > > Is this the same cluster with 1G partition size?
> > >
> > > -Arvinder
> > >
> > > On Mon, Jul 19, 2021, 4:51 PM MyWorld <ti...@gmail.com> wrote:
> > >
> > > > Hi daemeon,
> > > > We have already tuned the TCP settings to improve the bandwidth.
> Earlier
> > > > we had lot of hint and mutation msg drop which were gone after tuning
> > TCP.
> > > > Moreover we are writing with CL local quorum at US side, so ack is
> taken
> > > > from local DC.
> > > > I m still concern what could be reason of increase mutation count.
> > > >
> > > > On 2021/07/19 19:55:52 daemeon reiydelle wrote:
> > > > > You may want to think about the latency impacts of a cluster that
> has
> > one
> > > > > node "far away". This is such a basic design flaw that you need to
> do
> > > > some
> > > > > basic learning, and some basic understanding of networking and
> > latency.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 19, 2021 at 10:38 AM MyWorld <ti...@gmail.com> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Currently we have a cluster with 2 DC of 3 nodes each. One DC is
> in
> > > > GCP-US
> > > > > > while other is in GCP-India. Just to add here, configuration of
> > every
> > > > node
> > > > > > accross both DC is same. Cpu-6, Ram-32gb, Heap-8gb
> > > > > >
> > > > > > We do all our write on US data center. While performing a bulk
> > write on
> > > > > > GCP US, we observe normal load of 1 on US while this load at GCP
> > India
> > > > > > spikes to 10.
> > > > > >
> > > > > > On observing tpstats further in grafana we found mutation stage
> at
> > GCP
> > > > > > India is going to 1million intermittently though our overall
> write
> > is
> > > > > > nearly 300 per sec per node. Don't know the reason but whenever
> we
> > have
> > > > > > this spike, we are having load issue.
> > > > > > Please help what could be the possible reason for this?
> > > > > >
> > > > > > Regards,
> > > > > > Ashish
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to