Hey Josh, Transactions can be applied in parallel in the consumer side based on transaction dependency checking.
http://www.google.com.ar/patents/US20080163222 This patent documents how it work. It is easy to understand, however, you also need to consider the hash collision issues. This has been implemented in IBM Q Replication since 2001. Thanks, Xiao Li On Mar 3, 2015, at 3:36 PM, Jay Kreps <jay.kr...@gmail.com> wrote: > Hey Josh, > > As you say, ordering is per partition. Technically it is generally possible > to publish all changes to a database to a single partition--generally the > kafka partition should be high throughput enough to keep up. However there > are a couple of downsides to this: > 1. Consumer parallelism is limited to one. If you want a total order to the > consumption of messages you need to have just 1 process, but often you > would want to parallelize. > 2. Often what people want is not a full stream of all changes in all tables > in a database but rather the changes to a particular table. > > To some extent the best way to do this depends on what you will do with the > data. However if you intend to have lots > > I have seen pretty much every variation on this in the wild, and here is > what I would recommend: > 1. Have a single publisher process that publishes events into Kafka > 2. If possible use the database log to get these changes (e.g. mysql > binlog, Oracle xstreams, golden gate, etc). This will be more complete and > more efficient than polling for changes, though that can work too. > 3. Publish each table to its own topic. > 4. Partition each topic by the primary key of the table. > 5. Include in each message the database's transaction id, scn, or other > identifier that gives the total order within the record stream. Since there > is a single publisher this id will be monotonic within each partition. > > This seems to be the best set of tradeoffs for most use cases: > - You can have parallel consumers up to the number of partitions you chose > that still get messages in order per ID'd entity. > - You can subscribe to just one table if you like, or to multiple tables. > - Consumers who need a total order over all updates can do a "merge" across > the partitions to reassemble the fully ordered set of changes across all > tables/partitions. > > One thing to note is that the requirement of having a single consumer > process/thread to get the total order isn't really so much a Kafka > restriction as it just is a restriction about the world, since if you had > multiple threads even if you delivered messages to them in order their > processing might happen out of order (just do to the random timing of the > processing). > > -Jay > > > > On Tue, Mar 3, 2015 at 3:15 PM, Josh Rader <jrader...@gmail.com> wrote: > >> Hi Kafka Experts, >> >> >> >> We have a use case around RDBMS replication where we are investigating >> Kafka. In this case ordering is very important. Our understanding is >> ordering is only preserved within a single partition. This makes sense as >> a single thread will consume these messages, but our question is can we >> somehow parallelize this for better performance? Is there maybe some >> partition key strategy trick to have your cake and eat it too in terms of >> keeping ordering, but also able to parallelize the processing? >> >> >> >> I am sorry if this has already been asked, but we tried to search through >> the archives and couldn’t find this response. >> >> >> >> Thanks, >> >> Josh >>