Hi, For Cassandra CDC feature: http://cassandra.apache.org/doc/latest/operating/cdc.html
The CDC data is duplicated RF number of times. Let's say replication factor is 3 in one DC, the same data will be sent out 3 times. One solution is adding another DC with RF=1, which will be only used for CDC. Then it won't have duplicated data. But that's very costly to have a DC only for that job. And if any node goes down in that DC, there will be update lag. Our pipeline is pushing the data to kafka and then ingesting the data to Hive. When ingesting to Hive, we could have a 20 minutes of data in memory and de-dup. But kafka is going to store 3X data, which is also costly. Does anyone have the similar problem? What's your solution? Any feedback is welcomed. Thanks, Jay --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org