Hi,

I have a web service that serves up some data that it obtains from a kafka 
topic. When the process starts up, it wants to load the entire kafka topic into 
memory, and serve the data up from an in-memory hashtable. The data in the 
topic has primary keys and is log compacted, and so the total dataset will be 
small enough to fit in memory. My web service will only start serving up data 
when the entire topic is loaded. (And for that, 
https://issues.apache.org/jira/browse/KAFKA-1977 would be super useful).

I am only storing this data in memory. In the event of process death or 
restart, my in-memory state is gone, and so I will always want to rebuild it by 
again consuming the topic from the earliest offset. I will never need to 
checkpoint my offsets.

Also, I will have N instances of this application, each one needing to consume 
the entire topic. This is how I plan to do horizontal scaling of my web service.

I would like to use the high level consumer, so that I don't need to manually 
discover which broker is the leader, and so that I don't have to handle leader 
rebalancing.

A couple questions:
1) Does this use case make sense? Is this pattern used by anyone else? I like 
it because it makes my web service completely stateless.
2) In order to make each instance consume all partitions of the topic, I need 
each consumer group id to be unique to that process. So I was thinking of just 
using a UUID or something similar. What is the "cost" of creating a new 
consumer group id? If I am creating a new one every time I start my 
application, would I be cluttering up zookeeper or the __consumer_offsets 
topic? Note there will only every be N instances of my application running. 
Since I never will need to checkpoint my offsets, does that affect my question 
about "cluttering up" zookeeper/kafka? Are old consumer groups ever cleaned up 
out of zookeeper or the __consumer_offsets topic?
3) Are the stored offsets used for any other reason, aside from at startup of a 
new consumer? Are offsets used after rebalancing when partition leaders change 
due to broker failure? I know that offsets can be used for Burrow-like 
monitoring.
4) Since I don't need for support checkpointing, another option is to use the 
SimpleConsumer. The sample code at 
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example 
looks fairly comprehensive. It handles discovery of the partition leader, and 
handles leader rebalancing. Are there any other situations that I should be 
aware of before relying on that sample code?
5) Will any of this change when the new consumer comes out? Will the 
SimpleConsumer still exist when the new consumer comes out?

Thanks,
-James

Reply via email to