Re: Database Replication Question

James Cheng Wed, 04 Mar 2015 23:42:06 -0800

> On Mar 3, 2015, at 4:18 PM, Guozhang Wang <wangg...@gmail.com> wrote:
> 
> Additionally to Jay's recommendation, you also need to have some special
> cares in error handling of the producer in order to preserve ordering since
> producer uses batching and async sending. That is, if you already sent
> messages 1,2,3,4,5 to producer but later on be notified that message 3
> failed to send, you need to avoid continue sending messages 4,5 before 3
> gets fixed or dropped.
>


Guozhang, how would we do this? Would this require sending each message 
individually and waiting for acknowledgment of each message?

Send 1
Wait for ack
Send 2
Wait for ack
etc

If I try to send 1,2,3,4,5 in a batch, is it possible that the broker could 
receive 1,2 and 4,5, and that only 3 would fail? Or is it always a contiguous 
chunk, and then the first failure would cause the rest of the batch to abort?

-James

> Guozhang
> 
> On Tue, Mar 3, 2015 at 3:45 PM, Xiao <lixiao1...@gmail.com> wrote:
> 
>> Hey Josh,
>> 
>> Transactions can be applied in parallel in the consumer side based on
>> transaction dependency checking.
>> 
>> http://www.google.com.ar/patents/US20080163222
>> 
>> This patent documents how it work. It is easy to understand, however, you
>> also need to consider the hash collision issues. This has been implemented
>> in IBM Q Replication since 2001.
>> 
>> Thanks,
>> 
>> Xiao Li
>> 
>> 
>> On Mar 3, 2015, at 3:36 PM, Jay Kreps <jay.kr...@gmail.com> wrote:
>> 
>>> Hey Josh,
>>> 
>>> As you say, ordering is per partition. Technically it is generally
>> possible
>>> to publish all changes to a database to a single partition--generally the
>>> kafka partition should be high throughput enough to keep up. However
>> there
>>> are a couple of downsides to this:
>>> 1. Consumer parallelism is limited to one. If you want a total order to
>> the
>>> consumption of messages you need to have just 1 process, but often you
>>> would want to parallelize.
>>> 2. Often what people want is not a full stream of all changes in all
>> tables
>>> in a database but rather the changes to a particular table.
>>> 
>>> To some extent the best way to do this depends on what you will do with
>> the
>>> data. However if you intend to have lots
>>> 
>>> I have seen pretty much every variation on this in the wild, and here is
>>> what I would recommend:
>>> 1. Have a single publisher process that publishes events into Kafka
>>> 2. If possible use the database log to get these changes (e.g. mysql
>>> binlog, Oracle xstreams, golden gate, etc). This will be more complete
>> and
>>> more efficient than polling for changes, though that can work too.
>>> 3. Publish each table to its own topic.
>>> 4. Partition each topic by the primary key of the table.
>>> 5. Include in each message the database's transaction id, scn, or other
>>> identifier that gives the total order within the record stream. Since
>> there
>>> is a single publisher this id will be monotonic within each partition.
>>> 
>>> This seems to be the best set of tradeoffs for most use cases:
>>> - You can have parallel consumers up to the number of partitions you
>> chose
>>> that still get messages in order per ID'd entity.
>>> - You can subscribe to just one table if you like, or to multiple tables.
>>> - Consumers who need a total order over all updates can do a "merge"
>> across
>>> the partitions to reassemble the fully ordered set of changes across all
>>> tables/partitions.
>>> 
>>> One thing to note is that the requirement of having a single consumer
>>> process/thread to get the total order isn't really so much a Kafka
>>> restriction as it just is a restriction about the world, since if you had
>>> multiple threads even if you delivered messages to them in order their
>>> processing might happen out of order (just do to the random timing of the
>>> processing).
>>> 
>>> -Jay
>>> 
>>> 
>>> 
>>> On Tue, Mar 3, 2015 at 3:15 PM, Josh Rader <jrader...@gmail.com> wrote:
>>> 
>>>> Hi Kafka Experts,
>>>> 
>>>> 
>>>> 
>>>> We have a use case around RDBMS replication where we are investigating
>>>> Kafka.  In this case ordering is very important.  Our understanding is
>>>> ordering is only preserved within a single partition.  This makes sense
>> as
>>>> a single thread will consume these messages, but our question is can we
>>>> somehow parallelize this for better performance?   Is there maybe some
>>>> partition key strategy trick to have your cake and eat it too in terms
>> of
>>>> keeping ordering, but also able to parallelize the processing?
>>>> 
>>>> 
>>>> 
>>>> I am sorry if this has already been asked, but we tried to search
>> through
>>>> the archives and couldn’t find this response.
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Josh
>>>> 
>> 
>> 
> 
> 
> --
> -- Guozhang

Re: Database Replication Question

Reply via email to