Hi, We've been experimenting a little with running Kafka internally for better handling temporary throughput peaks of asynchronous tasks. However, we've had a big issues making Kafka work for us and I am starting to question whether its a good fit.
Our usecase: - High latency. At peaks, each consumer requires ~20 seconds to handle a single message/task. - Extreme variation in message size: Serialized tasks are in the range of ~300 bytes up to ~3 MB. - Generally, it is processed in 20 seconds independent of message size. Most messages are small. Our setup: - Kafka 0.9.0. - Using the new Java consumer API (consumer groups etc.). - To occasionally handle large (3 MB) messages we've had to set the following configuration parameters: - max.partition.fetch.bytes=10485760=10MB on consumer to handle larger messages. - session.timeout.ms=30s to handle our high latency processing. - replica.fetch.max.bytes=10485760=10MB on broker. - message.max.bytes=10485760=10MB on broker. Sample code: while (isRunning()) { ConsumerRecords<String, byte[]> records = consumer.poll(100); for (final ConsumerRecord<String, byte[]> record : records) { // Handle record... } } AFAIK this seem like a very basic consumer code. Initial problem: When doing load testing to simulate peaks our consumer started spinning infinitely in similar fashion to [1]. We also noticed that we consistently were seeing [2] in our broker log. [1] http://bit.ly/1Q7zxgh [2] https://gist.github.com/JensRantil/2d1e7db3bd919eb35f9b Root cause analysis: AFAIK, health checks are only submitted to Kafka when calling `consumer.poll(...)`. To handle larger messages, we needed to increase max.partition.fetch.bytes. However, due to our high latency consumer a large amounts of small messages could be prefetched which made our inner for loop run long enough for the broker to consider our consumer dead. Two questions: - Is there any workaround to avoid the broker thinking our consumer is dead? Increasing session timeout to handle the polling interval for small messages is not an option since we simply prefetch too many messages for that to be an option. Can we set a limit on how many messages Kafka prefetches? Or is there a way to send health checks to broker out of bands without invoking the `KafkaConsumer#poll` method? - Is Kafka a bad tool for our usecase? Thanks and have a nice weekend, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook <https://www.facebook.com/#!/tink.se> Linkedin <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary> Twitter <https://twitter.com/tink>