Re: Database Replication Question

Xiao Tue, 03 Mar 2015 15:46:43 -0800

Hey Josh,

Transactions can be applied in parallel in the consumer side based on 
transaction dependency checking.


http://www.google.com.ar/patents/US20080163222

This patent documents how it work. It is easy to understand, however, you also 
need to consider the hash collision issues. This has been implemented in IBM Q 
Replication since 2001.

Thanks, 

Xiao Li 


On Mar 3, 2015, at 3:36 PM, Jay Kreps <jay.kr...@gmail.com> wrote:

> Hey Josh,
> 
> As you say, ordering is per partition. Technically it is generally possible
> to publish all changes to a database to a single partition--generally the
> kafka partition should be high throughput enough to keep up. However there
> are a couple of downsides to this:
> 1. Consumer parallelism is limited to one. If you want a total order to the
> consumption of messages you need to have just 1 process, but often you
> would want to parallelize.
> 2. Often what people want is not a full stream of all changes in all tables
> in a database but rather the changes to a particular table.
> 
> To some extent the best way to do this depends on what you will do with the
> data. However if you intend to have lots
> 
> I have seen pretty much every variation on this in the wild, and here is
> what I would recommend:
> 1. Have a single publisher process that publishes events into Kafka
> 2. If possible use the database log to get these changes (e.g. mysql
> binlog, Oracle xstreams, golden gate, etc). This will be more complete and
> more efficient than polling for changes, though that can work too.
> 3. Publish each table to its own topic.
> 4. Partition each topic by the primary key of the table.
> 5. Include in each message the database's transaction id, scn, or other
> identifier that gives the total order within the record stream. Since there
> is a single publisher this id will be monotonic within each partition.
> 
> This seems to be the best set of tradeoffs for most use cases:
> - You can have parallel consumers up to the number of partitions you chose
> that still get messages in order per ID'd entity.
> - You can subscribe to just one table if you like, or to multiple tables.
> - Consumers who need a total order over all updates can do a "merge" across
> the partitions to reassemble the fully ordered set of changes across all
> tables/partitions.
> 
> One thing to note is that the requirement of having a single consumer
> process/thread to get the total order isn't really so much a Kafka
> restriction as it just is a restriction about the world, since if you had
> multiple threads even if you delivered messages to them in order their
> processing might happen out of order (just do to the random timing of the
> processing).
> 
> -Jay
> 
> 
> 
> On Tue, Mar 3, 2015 at 3:15 PM, Josh Rader <jrader...@gmail.com> wrote:
> 
>> Hi Kafka Experts,
>> 
>> 
>> 
>> We have a use case around RDBMS replication where we are investigating
>> Kafka.  In this case ordering is very important.  Our understanding is
>> ordering is only preserved within a single partition.  This makes sense as
>> a single thread will consume these messages, but our question is can we
>> somehow parallelize this for better performance?   Is there maybe some
>> partition key strategy trick to have your cake and eat it too in terms of
>> keeping ordering, but also able to parallelize the processing?
>> 
>> 
>> 
>> I am sorry if this has already been asked, but we tried to search through
>> the archives and couldn’t find this response.
>> 
>> 
>> 
>> Thanks,
>> 
>> Josh
>>

Re: Database Replication Question

Reply via email to