I feel a need to respond to the Sqoop-killer comment :) 1) Note that most databases have a single transaction log per db and in order to get the correct view of the DB, you need to read it in order (otherwise transactions will get messed up). This means you are limited to a single producer reading data from the log, writing it to a single partition and getting it read from a single consumer. If the database is very large and very active, you may run into some issues there...
Because Sqoop doesn't try to catch up with all the changes, but takes a snapshot (from multiple mappers in parallel), we can very rapidly Sqoop 10TB databases. 2) If HDFS is the target of getting data from Postgres, then postgresql -> kafka -> HDFS seems less optimal than postgresql -> HDFS directly (in parallel). There are good reasons to get Postgres data to Kafka, but if the eventual goal is HDFS (or HBase), I suspect Sqoop still has a place. 3) Due to its parallelism and general purpose JDBC connector, I suspect that Sqoop is even a very viable way of getting data into Kafka. Gwen On Thu, Apr 30, 2015 at 2:27 PM, Jan Filipiak <jan.filip...@trivago.com> wrote: > Hello Everyone, > > I am quite exited about the recent example of replicating PostgresSQL > Changes to Kafka. My view on the log compaction feature always had been a > very sceptical one, but now with its great potential exposed to the wide > public, I think its an awesome feature. Especially when pulling this data > into HDFS as a Snapshot, it is (IMO) a sqoop killer. So I want to thank > everyone who had the vision of building these kind of systems during a time > I could not imagine those. > > There is one open question that I would like people to help me with. When > pulling a snapshot of a partition into HDFS using a camus-like application > I feel the need of keeping a Set of all keys read so far and stop as soon > as I find a key beeing already in my set. I use this as an indicator of how > far the log compaction has happened already and only pull up to this point. > This works quite well as I do not need to keep the messages but only their > keys in memory. > > The question I want to raise with the community is: > > How do you prevent pulling the same record twice (in different versions) > and would it be beneficial if the "OffsetResponse" would also return the > last offset that got compacted so far and the application would just pull > up to this point? > > Looking forward for some recommendations and comments. > > Best > Jan > >