Ask for suggestions to de-duplicate data for Cassandra CDC

Jay Zhuang Tue, 20 Jun 2017 15:19:00 -0700

Hi,

For Cassandra CDC feature:
http://cassandra.apache.org/doc/latest/operating/cdc.html


The CDC data is duplicated RF number of times. Let's say replication
factor is 3 in one DC, the same data will be sent out 3 times.
One solution is adding another DC with RF=1, which will be only used for
CDC. Then it won't have duplicated data. But that's very costly to have
a DC only for that job. And if any node goes down in that DC, there will
be update lag.

Our pipeline is pushing the data to kafka and then ingesting the data to
Hive. When ingesting to Hive, we could have a 20 minutes of data in
memory and de-dup. But kafka is going to store 3X data, which is also
costly.

Does anyone have the similar problem? What's your solution? Any feedback
is welcomed.

Thanks,
Jay

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Ask for suggestions to de-duplicate data for Cassandra CDC

Reply via email to