Hi,

I have a use case that might be relevant to the new consumer API but that
would require most of the work on the broker. I would be surprised if it
had not been discussed before but I was not able to find any directly
related thread. Has there been any discussion about providing broker side
consume-request filtering of message streams?

Let me elaborate a bit - As a concrete example, we collect JMX metrics from
our application servers and push them into a single Kafka topic "jmx". We
have several application clusters, each with several host machines, each
running several application processes from which these metrics are
collected and pushed. Our message format is something like:

Datetime, cluster, host, process, metric name, metric value

with an example:

20140313235959999, appcluster001, host01, process001, jvm.ThreadCount, 100

We currently use plain CSVs here but have some vague plans on migrating to
Avro in the future. We make use of both the batching and compression
features.

Some consumers continuously retrieve the full set of JMX metrics and other
consumers have the need to connect ad-hoc and retrieve a subset of the JMX
topic. This could be the full retention set of one particular metric to
compare hosts against each other; it could be the full set of metrics for
one particular cluster or some other variation. For these ad-hoc consumers,
they'll currently need to retrieve the full data set of the topic, only to
discard what they are not interested in and this makes it tedious,
especially when bandwidth is limited and the consumer is not located in the
same datacenter as the Kafka cluster.

One idea working with current broker implementation would be to develop
some kind of proxy-consumer that would be co-located with the Kafka cluster
and perform the filtering but it would complicate things a lot more than
having this feature available in the consumer API directly.

Partitioning the topic would also be a solution but in the above example,
we would need in the order of a million partitions to account for all
combinations and I have a feeling that this number of partitions could
impact performance negatively or cause other problems.
Currently this topic has 12 partitions, solely to scale the number of
consumers.

I'm imagining a consumer API where I can subscribe to a topic and provide a
regex as a filter for the data I'm interested in and the broker would only
send back messages matching this filter. Even better - if the consumer
would be able to provide a custom function to apply to the response data.

Would this be a possible/feasible feature to have in the broker? What about
compressed topics or otherwise "encoded" messages? Is this use case very
narrow or would other people have similar interests? I would be interested
in hearing your thoughts or any suggestions you might have.

Reply via email to