Re: Database Replication Question

Jay Kreps Tue, 03 Mar 2015 15:37:36 -0800

Hey Josh,

As you say, ordering is per partition. Technically it is generally possible
to publish all changes to a database to a single partition--generally the
kafka partition should be high throughput enough to keep up. However there
are a couple of downsides to this:
1. Consumer parallelism is limited to one. If you want a total order to the
consumption of messages you need to have just 1 process, but often you
would want to parallelize.
2. Often what people want is not a full stream of all changes in all tables
in a database but rather the changes to a particular table.

To some extent the best way to do this depends on what you will do with the
data. However if you intend to have lots

I have seen pretty much every variation on this in the wild, and here is
what I would recommend:
1. Have a single publisher process that publishes events into Kafka
2. If possible use the database log to get these changes (e.g. mysql
binlog, Oracle xstreams, golden gate, etc). This will be more complete and
more efficient than polling for changes, though that can work too.
3. Publish each table to its own topic.
4. Partition each topic by the primary key of the table.
5. Include in each message the database's transaction id, scn, or other
identifier that gives the total order within the record stream. Since there
is a single publisher this id will be monotonic within each partition.

This seems to be the best set of tradeoffs for most use cases:
- You can have parallel consumers up to the number of partitions you chose
that still get messages in order per ID'd entity.
- You can subscribe to just one table if you like, or to multiple tables.
- Consumers who need a total order over all updates can do a "merge" across
the partitions to reassemble the fully ordered set of changes across all
tables/partitions.

One thing to note is that the requirement of having a single consumer
process/thread to get the total order isn't really so much a Kafka
restriction as it just is a restriction about the world, since if you had
multiple threads even if you delivered messages to them in order their
processing might happen out of order (just do to the random timing of the
processing).

-Jay

On Tue, Mar 3, 2015 at 3:15 PM, Josh Rader <jrader...@gmail.com> wrote:

> Hi Kafka Experts,
>
>
>
> We have a use case around RDBMS replication where we are investigating
> Kafka.  In this case ordering is very important.  Our understanding is
> ordering is only preserved within a single partition.  This makes sense as
> a single thread will consume these messages, but our question is can we
> somehow parallelize this for better performance?   Is there maybe some
> partition key strategy trick to have your cake and eat it too in terms of
> keeping ordering, but also able to parallelize the processing?
>
>
>
> I am sorry if this has already been asked, but we tried to search through
> the archives and couldn’t find this response.
>
>
>
> Thanks,
>
> Josh
>

Re: Database Replication Question

Reply via email to