Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

Joy Gao Tue, 11 Sep 2018 20:57:33 -0700

Re Rahul:  "Although DSE advanced replication does one way, those are use
cases with limited value to me because ultimately it’s still a master slave
design."
Completely agree. I'm not familiar with Calvin protocol, but that sounds
interesting (reading time...).


On Tue, Sep 11, 2018 at 8:38 PM Joy Gao <j...@wepay.com> wrote:

> Thank you all for the feedback so far.
>
> The immediate use case for us is setting up a real-time streaming data
> pipeline from C* to our Data Warehouse (BigQuery), where other teams can
> access the data for reporting/analytics/ad-hoc query. We already do this
> with MySQL
> <https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka>,
> where we stream the MySQL Binlog via Debezium <https://debezium.io>'s
> MySQL Connector to Kafka, and then use a BigQuery Sink Connector to stream
> data to BigQuery.
>
> Re Jon's comment about why not write to Kafka first? In some cases that
> may be ideal; but one potential concern we have with writing to Kafka first
> is not having "read-after-write" consistency. The data could be written to
> Kafka, but not yet consumed by C*. If the web service issues a (quorum)
> read immediately after the (quorum) write, the data that is being returned
> could still be outdated if the consumer did not catch up. Having web
> service interacts with C* directly solves this problem for us (we could add
> a cache before writing to Kafka, but that adds additional operational
> complexity to the architecture; alternatively, we could write to Kafka and
> C* transactionally, but distributed transaction is slow).
>
> Having the ability to stream its data to other systems could make C* more
> flexible and more easily integrated into a larger data ecosystem. As Dinesh
> has mentioned, implementing this in the database layer means there is a
> standard approach to getting a change notification stream (unlike trigger
> which is ad-hoc and customized). Aside from replication, the change events
> could be used for updating Elasticsearch, generating derived views (i.e.
> for reporting), sending to an audit services, sending to a notification
> service, and in our case, streaming to our data warehouse for analytics.
> (one article that goes over database streaming is Martin Kleppman's Turning
> the Database Inside Out with Apache Samza
> <https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/>,
> which seems relevant here). For reference, this turning database into a
> stream of change events is pretty common in SQL databases (i.e. mysql
> binlog, postgres WAL) and NoSQL databases that have primary-replica setup
> (i.e. Mongodb Oplog). Recently CockroachDB introduced a CDC feature as well
> (and they have master-less replication too).
>
> Hope that answers the question. That said, dedupe/ordering/getting full
> row of data via C* CDC is a hard problem, but may be worth solving for
> reasons mentioned above. Our proposal is an user approach to solve these
> problems. Maybe the more sensible thing to do is to build it as part of C*
> itself, but that's a much bigger discussion. If anyone is building a
> streaming pipeline for C*, we'd be interested in hearing their approaches
> as well.
>
>
> On Tue, Sep 11, 2018 at 7:01 AM Rahul Singh <rahul.xavier.si...@gmail.com>
> wrote:
>
>> You know what they say: Go big or go home.
>>
>> Right now candidates are Cassandra itself but embedded or on the side not
>> on the actual data clusters, zookeeper (yuck) , Kafka (which needs
>> zookeeper, yuck) , S3 (outside service dependency, so no go. )
>>
>> Jeff, Those are great patterns. ESP. Second one. Have used it several
>> times. Cassandra is a great place to store data in transport.
>>
>>
>> Rahul
>> On Sep 10, 2018, 5:21 PM -0400, DuyHai Doan <doanduy...@gmail.com>,
>> wrote:
>>
>> Also using Calvin means having to implement a distributed monotonic
>> sequence as a primitive, not trivial at all ...
>>
>> On Mon, Sep 10, 2018 at 3:08 PM, Rahul Singh <
>> rahul.xavier.si...@gmail.com> wrote:
>>
>>> In response to mimicking Advanced replication in DSE. I understand the
>>> goal. Although DSE advanced replication does one way, those are use cases
>>> with limited value to me because ultimately it’s still a master slave
>>> design.
>>>
>>> I’m working on a prototype for this for two way replication between
>>> clusters or databases regardless of dB tech - and every variation I can get
>>> to comes down to some implementation of the Calvin protocol which basically
>>> verifies the change in either cluster , sequences it according to impact to
>>> underlying data, and then schedules the mutation in a predictable manner on
>>> both clusters / DBS.
>>>
>>> All that means is that I need to sequence the change before it happens
>>> so I can predictably ensure it’s Scheduled for write / Mutation. So I’m
>>> Back to square one: having a definitive queue / ledger separate from the
>>> individual commit log of the cluster.
>>>
>>>
>>> Rahul Singh
>>> Chief Executive Officer
>>> m 202.905.2818
>>>
>>> Anant Corporation
>>> 1010 Wisconsin Ave NW, Suite 250
>>> <https://maps.google.com/?q=1010+Wisconsin+Ave+NW,+Suite+250+%0D%0AWashington,+D.C.+20007&entry=gmail&source=g>
>>> Washington, D.C. 20007
>>>
>>> We build and manage digital business technology platforms.
>>> On Sep 10, 2018, 3:58 AM -0400, Dinesh Joshi 
>>> <dinesh.jo...@yahoo.com.invalid>,
>>> wrote:
>>>
>>> On Sep 9, 2018, at 6:08 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>>
>>> There may be some use cases for it.. but I'm not sure what they are.  It
>>> might help if you shared the use cases where the extra complexity is
>>> required?  When does writing to Cassandra which then dedupes and writes to
>>> Kafka a preferred design then using Kafka and simply writing to Cassandra?
>>>
>>>
>>> From the reading of the proposal, it seems bring functionality similar
>>> to MySQL's binlog to Kafka connector. This is useful for many applications
>>> that want to be notified when certain (or any) rows change in the database
>>> primarily for a event driven application architecture.
>>>
>>> Implementing this in the database layer means there is a standard
>>> approach to getting a change notification stream. Downstream subscribers
>>> can then decide which notifications to act on.
>>>
>>> LinkedIn's databus is similar in functionality -
>>> https://github.com/linkedin/databus However it is for heterogenous
>>> datastores.
>>>
>>> On Thu, Sep 6, 2018 at 1:53 PM Joy Gao <j...@wepay.com.invalid> wrote:
>>>
>>>>
>>>>
>>>> We have a* WIP design doc
>>>> <https://wepayinc.box.com/s/fmdtw0idajyfa23hosf7x4ustdhb0ura>* that
>>>> goes over this idea in details.
>>>>
>>>> We haven't sort out all the edge cases yet, but would love to get some
>>>> feedback from the community on the general feasibility of this approach.
>>>> Any ideas/concerns/questions would be helpful to us. Thanks!
>>>>
>>>>
>>> Interesting idea. I did go over the proposal briefly. I concur with Jon
>>> about adding more use-cases to clarify this feature's potential use-cases.
>>>
>>> Dinesh
>>>
>>>
>>

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

Reply via email to