Hi,

For Cassandra CDC feature:
http://cassandra.apache.org/doc/latest/operating/cdc.html

The CDC data is duplicated RF number of times. Let's say replication
factor is 3 in one DC, the same data will be sent out 3 times.
One solution is adding another DC with RF=1, which will be only used for
CDC. Then it won't have duplicated data. But that's very costly to have
a DC only for that job. And if any node goes down in that DC, there will
be update lag.

Our pipeline is pushing the data to kafka and then ingesting the data to
Hive. When ingesting to Hive, we could have a 20 minutes of data in
memory and de-dup. But kafka is going to store 3X data, which is also
costly.

Does anyone have the similar problem? What's your solution? Any feedback
is welcomed.

Thanks,
Jay

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Reply via email to