Hi Gwen, As you said I see Bottled Water and Sqoop managing slightly different use cases so I don't see this feature as a Sqoop killer. However I did have a question on your comment that the transaction log or CDC approach will have problems with very large, very active databases.
I get that you need to have a single producer that transmits the transaction log changes to Kafka in order. However on the consumer side you can have a topic per table and then partition these topics by primary key to achieve nice parallelism. So it seems the producer is the potential bottleneck, but I imagine you can scale that appropriately vertically and put the proper HA. Would love to hear your thoughts on this. Jonathan On Thu, Apr 30, 2015 at 5:09 PM, Gwen Shapira <gshap...@cloudera.com> wrote: > I feel a need to respond to the Sqoop-killer comment :) > > 1) Note that most databases have a single transaction log per db and in > order to get the correct view of the DB, you need to read it in order > (otherwise transactions will get messed up). This means you are limited to > a single producer reading data from the log, writing it to a single > partition and getting it read from a single consumer. If the database is > very large and very active, you may run into some issues there... > > Because Sqoop doesn't try to catch up with all the changes, but takes a > snapshot (from multiple mappers in parallel), we can very rapidly Sqoop > 10TB databases. > > 2) If HDFS is the target of getting data from Postgres, then postgresql -> > kafka -> HDFS seems less optimal than postgresql -> HDFS directly (in > parallel). There are good reasons to get Postgres data to Kafka, but if the > eventual goal is HDFS (or HBase), I suspect Sqoop still has a place. > > 3) Due to its parallelism and general purpose JDBC connector, I suspect > that Sqoop is even a very viable way of getting data into Kafka. > > Gwen > > > On Thu, Apr 30, 2015 at 2:27 PM, Jan Filipiak <jan.filip...@trivago.com> > wrote: > > > Hello Everyone, > > > > I am quite exited about the recent example of replicating PostgresSQL > > Changes to Kafka. My view on the log compaction feature always had been a > > very sceptical one, but now with its great potential exposed to the wide > > public, I think its an awesome feature. Especially when pulling this data > > into HDFS as a Snapshot, it is (IMO) a sqoop killer. So I want to thank > > everyone who had the vision of building these kind of systems during a > time > > I could not imagine those. > > > > There is one open question that I would like people to help me with. When > > pulling a snapshot of a partition into HDFS using a camus-like > application > > I feel the need of keeping a Set of all keys read so far and stop as soon > > as I find a key beeing already in my set. I use this as an indicator of > how > > far the log compaction has happened already and only pull up to this > point. > > This works quite well as I do not need to keep the messages but only > their > > keys in memory. > > > > The question I want to raise with the community is: > > > > How do you prevent pulling the same record twice (in different versions) > > and would it be beneficial if the "OffsetResponse" would also return the > > last offset that got compacted so far and the application would just pull > > up to this point? > > > > Looking forward for some recommendations and comments. > > > > Best > > Jan > > > > >