> On Mar 3, 2015, at 4:18 PM, Guozhang Wang <wangg...@gmail.com> wrote: > > Additionally to Jay's recommendation, you also need to have some special > cares in error handling of the producer in order to preserve ordering since > producer uses batching and async sending. That is, if you already sent > messages 1,2,3,4,5 to producer but later on be notified that message 3 > failed to send, you need to avoid continue sending messages 4,5 before 3 > gets fixed or dropped. >
Guozhang, how would we do this? Would this require sending each message individually and waiting for acknowledgment of each message? Send 1 Wait for ack Send 2 Wait for ack etc If I try to send 1,2,3,4,5 in a batch, is it possible that the broker could receive 1,2 and 4,5, and that only 3 would fail? Or is it always a contiguous chunk, and then the first failure would cause the rest of the batch to abort? -James > Guozhang > > On Tue, Mar 3, 2015 at 3:45 PM, Xiao <lixiao1...@gmail.com> wrote: > >> Hey Josh, >> >> Transactions can be applied in parallel in the consumer side based on >> transaction dependency checking. >> >> http://www.google.com.ar/patents/US20080163222 >> >> This patent documents how it work. It is easy to understand, however, you >> also need to consider the hash collision issues. This has been implemented >> in IBM Q Replication since 2001. >> >> Thanks, >> >> Xiao Li >> >> >> On Mar 3, 2015, at 3:36 PM, Jay Kreps <jay.kr...@gmail.com> wrote: >> >>> Hey Josh, >>> >>> As you say, ordering is per partition. Technically it is generally >> possible >>> to publish all changes to a database to a single partition--generally the >>> kafka partition should be high throughput enough to keep up. However >> there >>> are a couple of downsides to this: >>> 1. Consumer parallelism is limited to one. If you want a total order to >> the >>> consumption of messages you need to have just 1 process, but often you >>> would want to parallelize. >>> 2. Often what people want is not a full stream of all changes in all >> tables >>> in a database but rather the changes to a particular table. >>> >>> To some extent the best way to do this depends on what you will do with >> the >>> data. However if you intend to have lots >>> >>> I have seen pretty much every variation on this in the wild, and here is >>> what I would recommend: >>> 1. Have a single publisher process that publishes events into Kafka >>> 2. If possible use the database log to get these changes (e.g. mysql >>> binlog, Oracle xstreams, golden gate, etc). This will be more complete >> and >>> more efficient than polling for changes, though that can work too. >>> 3. Publish each table to its own topic. >>> 4. Partition each topic by the primary key of the table. >>> 5. Include in each message the database's transaction id, scn, or other >>> identifier that gives the total order within the record stream. Since >> there >>> is a single publisher this id will be monotonic within each partition. >>> >>> This seems to be the best set of tradeoffs for most use cases: >>> - You can have parallel consumers up to the number of partitions you >> chose >>> that still get messages in order per ID'd entity. >>> - You can subscribe to just one table if you like, or to multiple tables. >>> - Consumers who need a total order over all updates can do a "merge" >> across >>> the partitions to reassemble the fully ordered set of changes across all >>> tables/partitions. >>> >>> One thing to note is that the requirement of having a single consumer >>> process/thread to get the total order isn't really so much a Kafka >>> restriction as it just is a restriction about the world, since if you had >>> multiple threads even if you delivered messages to them in order their >>> processing might happen out of order (just do to the random timing of the >>> processing). >>> >>> -Jay >>> >>> >>> >>> On Tue, Mar 3, 2015 at 3:15 PM, Josh Rader <jrader...@gmail.com> wrote: >>> >>>> Hi Kafka Experts, >>>> >>>> >>>> >>>> We have a use case around RDBMS replication where we are investigating >>>> Kafka. In this case ordering is very important. Our understanding is >>>> ordering is only preserved within a single partition. This makes sense >> as >>>> a single thread will consume these messages, but our question is can we >>>> somehow parallelize this for better performance? Is there maybe some >>>> partition key strategy trick to have your cake and eat it too in terms >> of >>>> keeping ordering, but also able to parallelize the processing? >>>> >>>> >>>> >>>> I am sorry if this has already been asked, but we tried to search >> through >>>> the archives and couldn’t find this response. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Josh >>>> >> >> > > > -- > -- Guozhang