Very good points, Gwen. I hadn't thought of Oracle Streams case of dependencies. I wonder if GoldenGate handles this better?
The tradeoff of these approaches is that each RDBMS will be proprietary on how to get this CDC information. I guess GoldenGate can be a standard interface on RDBMs, but there really isn't anything covering NoSQL stores like HBase, Cassandra, Mongo. I wish us poor analytics guys had more say on what OLTP stores the application teams use :) LinkedIn solved this pretty well with having their teams use Expresso which has a nice CDC pattern with MySQL engine under the covers. On Sun, May 10, 2015 at 12:48 AM, Gwen Shapira <gshap...@cloudera.com> wrote: > Hi Jonathan, > > I agree we can have topic-per-table, but some transactions may span > multiple tables and therefore will get applied partially out-of-order. I > suspect this can be a consistency issue and create a state that is > different than the state in the original database, but I don't have good > proof of it. > > I know that Oracle Streams has "Parallel Apply" feature where they figure > out whether transactions have dependencies and apply in parallel only if > they don't. So it sounds like dependencies may be an issue. > > Planning to give this more thought :) > > Gwen > > On Fri, May 1, 2015 at 7:56 PM, Jonathan Hodges <hodg...@gmail.com> wrote: > > > Hi Gwen, > > > > As you said I see Bottled Water and Sqoop managing slightly different use > > cases so I don't see this feature as a Sqoop killer. However I did have > a > > question on your comment that the transaction log or CDC approach will > have > > problems with very large, very active databases. > > > > I get that you need to have a single producer that transmits the > > transaction log changes to Kafka in order. However on the consumer side > > you can have a topic per table and then partition these topics by primary > > key to achieve nice parallelism. So it seems the producer is the > potential > > bottleneck, but I imagine you can scale that appropriately vertically and > > put the proper HA. > > > > Would love to hear your thoughts on this. > > > > Jonathan > > > > > > > > On Thu, Apr 30, 2015 at 5:09 PM, Gwen Shapira <gshap...@cloudera.com> > > wrote: > > > > > I feel a need to respond to the Sqoop-killer comment :) > > > > > > 1) Note that most databases have a single transaction log per db and in > > > order to get the correct view of the DB, you need to read it in order > > > (otherwise transactions will get messed up). This means you are limited > > to > > > a single producer reading data from the log, writing it to a single > > > partition and getting it read from a single consumer. If the database > is > > > very large and very active, you may run into some issues there... > > > > > > Because Sqoop doesn't try to catch up with all the changes, but takes a > > > snapshot (from multiple mappers in parallel), we can very rapidly Sqoop > > > 10TB databases. > > > > > > 2) If HDFS is the target of getting data from Postgres, then postgresql > > -> > > > kafka -> HDFS seems less optimal than postgresql -> HDFS directly (in > > > parallel). There are good reasons to get Postgres data to Kafka, but if > > the > > > eventual goal is HDFS (or HBase), I suspect Sqoop still has a place. > > > > > > 3) Due to its parallelism and general purpose JDBC connector, I suspect > > > that Sqoop is even a very viable way of getting data into Kafka. > > > > > > Gwen > > > > > > > > > On Thu, Apr 30, 2015 at 2:27 PM, Jan Filipiak < > jan.filip...@trivago.com> > > > wrote: > > > > > > > Hello Everyone, > > > > > > > > I am quite exited about the recent example of replicating PostgresSQL > > > > Changes to Kafka. My view on the log compaction feature always had > > been a > > > > very sceptical one, but now with its great potential exposed to the > > wide > > > > public, I think its an awesome feature. Especially when pulling this > > data > > > > into HDFS as a Snapshot, it is (IMO) a sqoop killer. So I want to > thank > > > > everyone who had the vision of building these kind of systems during > a > > > time > > > > I could not imagine those. > > > > > > > > There is one open question that I would like people to help me with. > > When > > > > pulling a snapshot of a partition into HDFS using a camus-like > > > application > > > > I feel the need of keeping a Set of all keys read so far and stop as > > soon > > > > as I find a key beeing already in my set. I use this as an indicator > of > > > how > > > > far the log compaction has happened already and only pull up to this > > > point. > > > > This works quite well as I do not need to keep the messages but only > > > their > > > > keys in memory. > > > > > > > > The question I want to raise with the community is: > > > > > > > > How do you prevent pulling the same record twice (in different > > versions) > > > > and would it be beneficial if the "OffsetResponse" would also return > > the > > > > last offset that got compacted so far and the application would just > > pull > > > > up to this point? > > > > > > > > Looking forward for some recommendations and comments. > > > > > > > > Best > > > > Jan > > > > > > > > > > > > > >