Hello!

We are working an a Flink Streaming job that reads data from multiple Kafka
topics and writes them to DFS. We are using StreamingFileSink with custom
implementation for GCS FS and it generates a lot of files as streams are
partitioned among multiple JMs. In the ideal case we should have at most 1
file per kafka topic per interval. We also have some heavy topics and some
pretty light ones, so the solution should also be smart to utilize
resources efficiently.

I was thinking we can partition based on how much data is ingested in the
last minute or so to make sure: messages from the same topic are routed to
the same (or minimal number of ) file if there are enough resources to do
so. Think bin packing.

Is it a good idea? Is there a built in way to achieve it? If not, is there
a way to push state into the partitioner (or even kafka client to
repartition in the source)? I was thinking that I can run a side stream
that will calculate data volumes and then broadcast it into the main
stream, so partitioner can make a decision, but it feels a bit complex.

Another way is to modify kafka client to track messages per topics and make
decision at that layer.

Am I on the right path?

Thank you

Reply via email to