Re Rahul: "Although DSE advanced replication does one way, those are use cases with limited value to me because ultimately it’s still a master slave design." Completely agree. I'm not familiar with Calvin protocol, but that sounds interesting (reading time...).
On Tue, Sep 11, 2018 at 8:38 PM Joy Gao <j...@wepay.com> wrote: > Thank you all for the feedback so far. > > The immediate use case for us is setting up a real-time streaming data > pipeline from C* to our Data Warehouse (BigQuery), where other teams can > access the data for reporting/analytics/ad-hoc query. We already do this > with MySQL > <https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka>, > where we stream the MySQL Binlog via Debezium <https://debezium.io>'s > MySQL Connector to Kafka, and then use a BigQuery Sink Connector to stream > data to BigQuery. > > Re Jon's comment about why not write to Kafka first? In some cases that > may be ideal; but one potential concern we have with writing to Kafka first > is not having "read-after-write" consistency. The data could be written to > Kafka, but not yet consumed by C*. If the web service issues a (quorum) > read immediately after the (quorum) write, the data that is being returned > could still be outdated if the consumer did not catch up. Having web > service interacts with C* directly solves this problem for us (we could add > a cache before writing to Kafka, but that adds additional operational > complexity to the architecture; alternatively, we could write to Kafka and > C* transactionally, but distributed transaction is slow). > > Having the ability to stream its data to other systems could make C* more > flexible and more easily integrated into a larger data ecosystem. As Dinesh > has mentioned, implementing this in the database layer means there is a > standard approach to getting a change notification stream (unlike trigger > which is ad-hoc and customized). Aside from replication, the change events > could be used for updating Elasticsearch, generating derived views (i.e. > for reporting), sending to an audit services, sending to a notification > service, and in our case, streaming to our data warehouse for analytics. > (one article that goes over database streaming is Martin Kleppman's Turning > the Database Inside Out with Apache Samza > <https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/>, > which seems relevant here). For reference, this turning database into a > stream of change events is pretty common in SQL databases (i.e. mysql > binlog, postgres WAL) and NoSQL databases that have primary-replica setup > (i.e. Mongodb Oplog). Recently CockroachDB introduced a CDC feature as well > (and they have master-less replication too). > > Hope that answers the question. That said, dedupe/ordering/getting full > row of data via C* CDC is a hard problem, but may be worth solving for > reasons mentioned above. Our proposal is an user approach to solve these > problems. Maybe the more sensible thing to do is to build it as part of C* > itself, but that's a much bigger discussion. If anyone is building a > streaming pipeline for C*, we'd be interested in hearing their approaches > as well. > > > On Tue, Sep 11, 2018 at 7:01 AM Rahul Singh <rahul.xavier.si...@gmail.com> > wrote: > >> You know what they say: Go big or go home. >> >> Right now candidates are Cassandra itself but embedded or on the side not >> on the actual data clusters, zookeeper (yuck) , Kafka (which needs >> zookeeper, yuck) , S3 (outside service dependency, so no go. ) >> >> Jeff, Those are great patterns. ESP. Second one. Have used it several >> times. Cassandra is a great place to store data in transport. >> >> >> Rahul >> On Sep 10, 2018, 5:21 PM -0400, DuyHai Doan <doanduy...@gmail.com>, >> wrote: >> >> Also using Calvin means having to implement a distributed monotonic >> sequence as a primitive, not trivial at all ... >> >> On Mon, Sep 10, 2018 at 3:08 PM, Rahul Singh < >> rahul.xavier.si...@gmail.com> wrote: >> >>> In response to mimicking Advanced replication in DSE. I understand the >>> goal. Although DSE advanced replication does one way, those are use cases >>> with limited value to me because ultimately it’s still a master slave >>> design. >>> >>> I’m working on a prototype for this for two way replication between >>> clusters or databases regardless of dB tech - and every variation I can get >>> to comes down to some implementation of the Calvin protocol which basically >>> verifies the change in either cluster , sequences it according to impact to >>> underlying data, and then schedules the mutation in a predictable manner on >>> both clusters / DBS. >>> >>> All that means is that I need to sequence the change before it happens >>> so I can predictably ensure it’s Scheduled for write / Mutation. So I’m >>> Back to square one: having a definitive queue / ledger separate from the >>> individual commit log of the cluster. >>> >>> >>> Rahul Singh >>> Chief Executive Officer >>> m 202.905.2818 >>> >>> Anant Corporation >>> 1010 Wisconsin Ave NW, Suite 250 >>> <https://maps.google.com/?q=1010+Wisconsin+Ave+NW,+Suite+250+%0D%0AWashington,+D.C.+20007&entry=gmail&source=g> >>> Washington, D.C. 20007 >>> >>> We build and manage digital business technology platforms. >>> On Sep 10, 2018, 3:58 AM -0400, Dinesh Joshi >>> <dinesh.jo...@yahoo.com.invalid>, >>> wrote: >>> >>> On Sep 9, 2018, at 6:08 AM, Jonathan Haddad <j...@jonhaddad.com> wrote: >>> >>> There may be some use cases for it.. but I'm not sure what they are. It >>> might help if you shared the use cases where the extra complexity is >>> required? When does writing to Cassandra which then dedupes and writes to >>> Kafka a preferred design then using Kafka and simply writing to Cassandra? >>> >>> >>> From the reading of the proposal, it seems bring functionality similar >>> to MySQL's binlog to Kafka connector. This is useful for many applications >>> that want to be notified when certain (or any) rows change in the database >>> primarily for a event driven application architecture. >>> >>> Implementing this in the database layer means there is a standard >>> approach to getting a change notification stream. Downstream subscribers >>> can then decide which notifications to act on. >>> >>> LinkedIn's databus is similar in functionality - >>> https://github.com/linkedin/databus However it is for heterogenous >>> datastores. >>> >>> On Thu, Sep 6, 2018 at 1:53 PM Joy Gao <j...@wepay.com.invalid> wrote: >>> >>>> >>>> >>>> We have a* WIP design doc >>>> <https://wepayinc.box.com/s/fmdtw0idajyfa23hosf7x4ustdhb0ura>* that >>>> goes over this idea in details. >>>> >>>> We haven't sort out all the edge cases yet, but would love to get some >>>> feedback from the community on the general feasibility of this approach. >>>> Any ideas/concerns/questions would be helpful to us. Thanks! >>>> >>>> >>> Interesting idea. I did go over the proposal briefly. I concur with Jon >>> about adding more use-cases to clarify this feature's potential use-cases. >>> >>> Dinesh >>> >>> >>