My understand is that all CDC really is now is a stable commit log reader. For a given mutation on an RF=3 system, you'll end up with 3 readers that all *could* do some action. For now let's just say "put it it in a Kafka topic" because that lets us do anything we want after that.
I suppose the most straightforward way would be to have a daemon that runs on each Cassandra node who's job it is to see a mutation come through then apply a lock via LWT in order to make a best effort at only-once semantics (but we know this is best effort and we're really going to get at-least-once) The node that acquired the lock (let's call it A) would have to be responsible for marking the mutation as complete. The other nodes (B & C) that failed at the LWT insert would be the backup nodes to ensure if A node died pushing the mutation along they would be able to do it. I'm not sure how each mutation is identified at the moment, is there a mutation ID that gets sent to each replica? Without that you'd need to compute some hash I support and use that as the LWT key but of course now you have to deal with potential collissions, maybe you'd use {mutation_timestamp}.{mutation_hash} and call it a day. The problem here is now we've added a ton of Paxos rounds to every mutation, increasing the load on our Cassandra cluster by a significant amount. The alternative, which might be more performant, is to keep the mutation state in memory, or in a local keyspace? This has additional issues, like what do we do when topology changes? A third option is you assume your going to get constant at-least-once-probably-a-whole-bunch and have all of your Kafka consumers deal with the fact that they'll be getting RF copies of a mutation and do the dedupe on their end. Maybe you'd have a single topic for raw, unduplicated data that keeps track of what it's seen and forwards on only a single copy? The tradeoff here is now you've got 3 copies of each mutation in the first topic plus another that gets forwarded on plus you may want a durable log of what's been seen since, you know, servers do restart sometimes. I'll be completely honest, I don't have a good answer to any of this. Doing it at the Cassandra level in either of the first 2 scenarios above feels like a mess and a lot of work, I'd probably spent the most time working on option 3, aka using Kafka consumer to dedupe and produce a new, clean topic. I pick this not because it shines as the most elegant or efficient option but because I feel like I'd be able to write it and have the best chance of getting it reasonably close to correct. Let me make sure I'm being clear on this - I would approach the development of CDC the exact same way it's been built. A CL reader is a logical first step towards building a useful system and is flexible to build all sorts of designs on top of it. I just want to point out that it's not as simple as "point it at HDFS and let it do it's magic". There will be a significant investment of time, probably a lot more than just putting Kafka in front of Cassandra & HDFS. Hopefully someone smarter than me can figure out how to make this work well. Jon On Tue, Aug 9, 2016 at 11:27 AM Ryan Svihla <r...@foundev.pro> wrote: > Jon, > > You know I've not actually spent the hour to read the ticket so I was just > guessing it didn't handle dedup...all the same semantics apply > though..you'd have to do a read before write and then allow some window of > failure mode. Maybe if you were LWT everything but that sounds really > slow...I'd be curious of your thoughts on how to do that well..maybe I'm > missing something. > > Regards, > Ryan Svihla > > On Aug 9, 2016, 1:13 PM -0500, Jonathan Haddad <j...@jonhaddad.com>, wrote: > > I'm having a hard time seeing how anyone would be able to work with CDC in > it's currently implementation of not doing any dedupe. Unless you really > want to write all your own logic for that including failure handling + a > distributed state machine I wouldn't count on it as a solution. > > On Tue, Aug 9, 2016 at 10:49 AM Ryan Svihla <r...@foundev.pro> wrote: > >> You can follow the monster of a ticket >> *https://issues.apache.org/jira/browse/CASSANDRA-8844* >> <https://issues.apache.org/jira/browse/CASSANDRA-8844> and see if it >> looks like the tradeoffs there are headed in the right direction for you. >> >> even CDC I think would have the logically same issue of not deduping for >> you as triggers and dual write due to replication factor and consistently >> level issues. Otherwise you'd be stuck doing an all replica comparison when >> a late event came in and when a node was down what would you do then? what >> if one replica got it as well and then came on line much later? Even if you >> were using a single source of truth style database, you'll find failover >> has a way of losing late events anyway (due to async replication) not to >> mention once you go multiple dc it's all a matter of what DC you're in. >> >> Anyway for the cold storage I think a trailing amount that is just >> greater than your old events would do it. IE if you choose to only accept >> 30 days out then cold storage for 32 days. At some point there is no free >> lunch as you point out when replicating between two data sources. ie CDC, >> triggers really anything that marks a "new event" will have the same >> problem and you'll have to choose an acceptable level of lateness or check >> for lateness indefinitely. >> >> Alternatively you can just accept duplication and handle it cold storage >> read side (like event sourcing pattern, this would be ideal if the lateness >> is uncommon) or clean it up over time in cold storage as it's detected >> (similar to an event sourcing pattern, but snapshotting data down to a >> single record when you encounter it on a read). >> >> Best of luck, this is a corner case that requires hard tradeoffs in all >> technology I've encountered. >> >> Regards, >> Ryan Svihla >> >> On Aug 9, 2016, 12:21 PM -0500, Ben Vogan <b...@shopkick.com>, wrote: >> >> Thanks Ryan. I was hoping there was a change data capture framework. We >> have late arriving events, some of which can be very late. We would have >> to batch collect data for a large time period every so often to go back and >> collect those or accept that we are going to lose a small percentage of >> events. Neither of which is ideal. >> >> On Tue, Aug 9, 2016 at 10:30 AM, Ryan Svihla <r...@foundev.pro> wrote: >> >>> The typical pattern I've seen in the field is kafka + consumers for each >>> destination (variant of dual write I know), this of course would not work >>> for your goal of relying on C* for dedup. Triggers would also suffer the >>> same problem unfortunately so you're really left with a batch job (most >>> likely Spark) to move data from C* into HDFS on a given interval. If this >>> is really a cold storage use case that can work quite well especially >>> assuming you've modeled your data as a time series or with some sort of >>> time based bucketing so you can quickly get full partitions data out of C* >>> in a deterministic fashion and not have to scan your entire data set. >>> >>> I've also for similar needs have seen Spark streaming + querying >>> cassandra for duplication checks to dedup then output to another source >>> (form of dual write but with dedup), this was really silly and slow. I only >>> bring it up to save you the trouble in case you end up in the same path >>> chasing for something more 'real time'. >>> >>> Regards, >>> Ryan Svihla >>> >>> On Aug 9, 2016, 11:09 AM -0500, Ben Vogan <b...@shopkick.com>, wrote: >>> >>> Hi all, >>> >>> We are investigating using Cassandra in our data platform. We would >>> like data to go into Cassandra first and to eventually be replicated into >>> our data lake in HDFS for long term cold storage. Does anyone know of a >>> good way of doing this? We would rather not have parallel writes to HDFS >>> and Cassandra because we were hoping that we could use Cassandra primary >>> keys to de-duplicate events. >>> >>> Thanks, >>> -- >>> <http://shopkick.com/> >>> *BENJAMIN VOGAN* | Data Platform Team Lead >>> shopkick <http://www.shopkick.com/> >>> <http://facebook.com/shopkick> <http://instagram.com/shopkick> >>> <http://pinterest.com/shopkick> <http://twitter.com/shopkick> >>> <https://www.linkedin.com/company/831240?trk=tyah&trkInfo=clickedVertical%3Acompany%2CentityType%3AentityHistoryName%2CclickedEntityId%3Acompany_831240%2Cidx%3A0> >>> >>> The indispensable app that rewards you for shopping. >>> >>> >> >> >> -- >> <http://shopkick.com/> >> *BENJAMIN VOGAN* | Data Platform Team Lead >> shopkick <http://www.shopkick.com/> >> <http://facebook.com/shopkick> <http://instagram.com/shopkick> >> <http://pinterest.com/shopkick> <http://twitter.com/shopkick> >> <https://www.linkedin.com/company/831240?trk=tyah&trkInfo=clickedVertical%3Acompany%2CentityType%3AentityHistoryName%2CclickedEntityId%3Acompany_831240%2Cidx%3A0> >> >> The indispensable app that rewards you for shopping. >> >>