Re: Network Bandwidth and Multi-DC replication

Jens Fischer Tue, 15 Dec 2020 08:39:39 -0800

Hi Scott,

Thank you for your help. There was an error or at least an ambiguity in my 
second Mail! I wrote:


I still see outgoing cross-DC traffic of ~ 2x the “write size A”

What I wanted to say was:  I still see outgoing cross-DC traffic of ~ 2x the 
“write size A” per remote DC or 4x the "write size A” in total.

Your response underlines that this is way more than expected. Any idea what 
could cause this or how to further debug? As mentioned I already checked for 
anti-entropy repair (not running) and read repairs (read_repair_chance and 
dclocal_read_repair_chance set to 0) and hints (no hints replay according to 
logs, hint directory empty).

Best
Jens

On 10. Dec 2020, at 01:28, Scott Hirleman 
<scott.hirle...@gmail.com<mailto:scott.hirle...@gmail.com>> wrote:

2x makes sense tho. If you have 3 DCs, you write locally to DC1 and then it 
gets replicated once in DC1 and then it gets replicated to DC2 AND DC3 at 
consistency local_one via cross DCtraffic to one of the nodes in each DC, then 
replicated in each DC to a second node via local traffic

Write comes in to DC1, node 1; it replicates to 1) DC1 node 2, DC2 node 1, and 
DC3 node 1. So the outgoing traffic is 2x the write size by going to each DC2 
and DC3. Once it gets written to DC2 node 1, it gets replicated locally to DC2 
node 2; same for DC3 re DC3 node 2.

On Wed, Dec 2, 2020 at 9:36 AM Jens Fischer 
<j.fisc...@sonnen.de<mailto:j.fisc...@sonnen.de>> wrote:
Hi,

I checked for all the given other factors - anti entropy repair, hints, read 
repair - and I still see outgoing cross-DC traffic of ~ 2x the “write size A” 
(as defined below). Given Jeffs answers this is not to be expected, i.e. there 
is something wrong here. Does anybody have an idea how to debug?

I define the "write size A”  as follows: Take the incoming traffic from all 
nodes inserting into DC1 and sum it up.

Best
Jens

On 30. Nov 2020, at 12:00, Jens Fischer 
<j.fisc...@sonnen.de<mailto:j.fisc...@sonnen.de>> wrote:

Hi Jeff,

Thank you for your answer, very helpful already!

All writes are done with `LOCAL_ONE` and we have RF=2 in each data center.

To compare our examples we need to come to an agreement on what you are calling 
“write size A”. I gave two different write sizes:

I call the bandwidth for receiving the the data on Node A "base bandwidth”

This is the inbound traffic at Node A. Data to Node A is transmitted as 
Protobuf inside VPN tunnels. A very rough estimate of data size, I know. Node A 
is not a Cassandra node!


Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth

I looked at all the Cassandra nodes in DC1 and the traffic coming from Node A. 
I then summed up this traffic.
@Jeff: I assume this is closer to what you call “write size A”?

Best
Jens


On 26. Nov 2020, at 17:12, Jeff Jirsa 
<jji...@gmail.com<mailto:jji...@gmail.com>> wrote:



On Nov 26, 2020, at 9:53 AM, Jens Fischer 
<j.fisc...@sonnen.de<mailto:j.fisc...@sonnen.de>> wrote:

 Hi,

we run a Cassandra cluster with three DCs. We noticed that the traffic incurred 
by running the Cluster is significant.

Consider the following simplified IoT scenario:

* time series data from devices in the field is received at Node A
* Node A inserts the data into DC 1
* DC 1 replicates the data within the DC and two the other two DCs

The traffic this produces is significant. The numbers below are based on 
observing the incoming and outgoing traffic on the node level:

* I call the bandwidth for receiving the the data on Node A "base bandwidth"
* Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth
* Replication to each of the other data centres takes 5 times the base bandwidth
* overall we see a “bandwidth amplification” of ~ 13x (3+5+5)


You didn’t specify consistency levels or replication factors so it’s hard to 
check your math.

Here’s what I’d expect

If you do RF=3 per DC and have 3 DCs, a write of size A is written to the 
cluster using coordinator C

C sends that write to replicas R1, R2, and R3 in the local DC
C sends the write to F2 and F3 - forwarders - one in each remote DC
F2 sends the write to R1-2, R2-2 in the remote DC2 and itself (F2 will be a 
replica), each replica sends an ack back to C
F3 sends the write to R1-3, R2-3 in the remote DC3 and itself (F3 will be a 
replica), each replica sends an ack back to C

You can avoid one extra write using token aware routing and making C a replica 
(R1, for example)

Given this, I don’t see how a remote DC is 5x A - it should be cross DC/WAN 
cost A into the forwarder and 2A out of the forwarder (local traffic , 
cross-AZ/rack but not WAN), with trivial ACK cost to the original DC.

If you’re seeing more than this, it may be something other than pure writes - 
anti entropy repair, hints, read repair are all possible, and would have 
different causes and fixes.

Most people who get to this level of calculation are doing so because they’re 
trying to solve a problem, and the common problem is that cross-AZ traffic in 
cloud providers is expensive at scale. If that’s why you’re asking, compression 
is your obvious win, and reducing RF is your alternative option (3/3/3 is super 
expensive - how many dcs take writes directly and which consistency level are 
you using? What’s the point of having 9 copies of the data? Would 1 copy per dc 
be enough if you’re doing global quorum? Would 2 copies in the cold DCs be 
enough if you’re only reading / writing from one DC?).

My questions:

1. Would you considers these factors expected behaviour?

13 seems high. 9 seems more correct unless you’re double counting sending and 
receiving.

2. Are there ways to reduce the traffic through configuration?

Compression, reducing RF, maybe mitigation with longer timeouts to avoid double 
sending hints.


A few additional notes on the setup:

* use NetworkTopologyStrategy for replication and cassandra-rackdc.properties 
to configure the GossipingPropertyFileSnitch
* internode_compression is set to dc
* inter_dc_tcp_nodelay is set to false

Any help is highly appreciated!

Best Regards
Jens

Geschäftsführer: Oliver Koch (CEO), Jean-Baptiste Cornefert, Christoph 
Ostermann, Hermann Schweizer, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Geschäftsführer: Oliver Koch (CEO), Jean-Baptiste Cornefert, Christoph 
Ostermann, Hermann Schweizer, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Geschäftsführer: Oliver Koch (CEO), Jean-Baptiste Cornefert, Christoph 
Ostermann, Hermann Schweizer, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


--
Scott Hirleman
scott.hirle...@gmail.com<mailto:scott.hirle...@gmail.com>


Geschäftsführer: Oliver Koch (CEO), Jean-Baptiste Cornefert, Christoph 
Ostermann, Hermann Schweizer, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908

Re: Network Bandwidth and Multi-DC replication

Reply via email to