Hi, I have a use case that might be relevant to the new consumer API but that would require most of the work on the broker. I would be surprised if it had not been discussed before but I was not able to find any directly related thread. Has there been any discussion about providing broker side consume-request filtering of message streams?
Let me elaborate a bit - As a concrete example, we collect JMX metrics from our application servers and push them into a single Kafka topic "jmx". We have several application clusters, each with several host machines, each running several application processes from which these metrics are collected and pushed. Our message format is something like: Datetime, cluster, host, process, metric name, metric value with an example: 20140313235959999, appcluster001, host01, process001, jvm.ThreadCount, 100 We currently use plain CSVs here but have some vague plans on migrating to Avro in the future. We make use of both the batching and compression features. Some consumers continuously retrieve the full set of JMX metrics and other consumers have the need to connect ad-hoc and retrieve a subset of the JMX topic. This could be the full retention set of one particular metric to compare hosts against each other; it could be the full set of metrics for one particular cluster or some other variation. For these ad-hoc consumers, they'll currently need to retrieve the full data set of the topic, only to discard what they are not interested in and this makes it tedious, especially when bandwidth is limited and the consumer is not located in the same datacenter as the Kafka cluster. One idea working with current broker implementation would be to develop some kind of proxy-consumer that would be co-located with the Kafka cluster and perform the filtering but it would complicate things a lot more than having this feature available in the consumer API directly. Partitioning the topic would also be a solution but in the above example, we would need in the order of a million partitions to account for all combinations and I have a feeling that this number of partitions could impact performance negatively or cause other problems. Currently this topic has 12 partitions, solely to scale the number of consumers. I'm imagining a consumer API where I can subscribe to a topic and provide a regex as a filter for the data I'm interested in and the broker would only send back messages matching this filter. Even better - if the consumer would be able to provide a custom function to apply to the response data. Would this be a possible/feasible feature to have in the broker? What about compressed topics or otherwise "encoded" messages? Is this use case very narrow or would other people have similar interests? I would be interested in hearing your thoughts or any suggestions you might have.