Thanks Guozhang, I was looking for actual real world workflows. I realize you can commit after each message but if you’re using ZK for offsets for instance you’ll put too much write load on the nodes and crush your throughput. So I was interested in batching strategies people have used that balance high/full throughput and fully committed events.
On Thu, Jul 31, 2014 at 8:16 AM, Guozhang Wang <wangg...@gmail.com> wrote: > Hi Jim, > > Whether to use high level or simple consumer depends on your use case. If > you need to manually manage partition assignments among your consumers, or > you need to commit your offsets elsewhere than ZK, or you do not want auto > rebalancing of consumers upon failures etc, you will use simple consumers; > otherwise you use high level consumer. > > From your description of pulling a batch of messages it seems you are > currently using the simple consumer. Suppose you are using the high level > consumer, to achieve at-lease-once basically you can do sth like: > > message = consumer.iter.next() > process(message) > consumer.commit() > > which is effectively the same as option 2 for using a simple consumer. Of > course, doing so has a heavy overhead of one-commit-per-message, you can > also do option 1, by the cost of duplicates, which is tolerable for > at-least-once. > > Guozhang > > > On Wed, Jul 30, 2014 at 8:25 PM, Jim <jimi...@gmail.com> wrote: > > > Curious on a couple questions... > > > > Are most people(are you?) using the simple consumer vs the high level > > consumer in production? > > > > > > What is the common processing paradigm for maintaining a full pipeline > for > > kafka consumers for at-least-once messaging? E.g. you pull a batch of > 1000 > > messages and: > > > > option 1. > > you wait for the slowest worker to finish working on that message, when > you > > get back 1000 acks internally you commit your offset and pull another > batch > > > > option 2. > > you feed your workers n msgs at a time in sequence and move your offset > up > > as you work through your batch > > > > option 3. > > you maintain a full stream of 1000 messages ideally and as you get acks > > back from your workers you see if you can move your offset up in the > stream > > to pull n more messages to fill up your pipeline so you're not blocked by > > the slowest consumer (probability wise) > > > > > > any good docs or articles on the subject would be great, thanks! > > > > > > -- > -- Guozhang >