Hello! We are working an a Flink Streaming job that reads data from multiple Kafka topics and writes them to DFS. We are using StreamingFileSink with custom implementation for GCS FS and it generates a lot of files as streams are partitioned among multiple JMs. In the ideal case we should have at most 1 file per kafka topic per interval. We also have some heavy topics and some pretty light ones, so the solution should also be smart to utilize resources efficiently.
I was thinking we can partition based on how much data is ingested in the last minute or so to make sure: messages from the same topic are routed to the same (or minimal number of ) file if there are enough resources to do so. Think bin packing. Is it a good idea? Is there a built in way to achieve it? If not, is there a way to push state into the partitioner (or even kafka client to repartition in the source)? I was thinking that I can run a side stream that will calculate data volumes and then broadcast it into the main stream, so partitioner can make a decision, but it feels a bit complex. Another way is to modify kafka client to track messages per topics and make decision at that layer. Am I on the right path? Thank you