Hi Marek, what you are describing is a known problem in Flink. There are some thoughts on how to address this in https://issues.apache.org/jira/browse/FLINK-11499 and https://issues.apache.org/jira/browse/FLINK-17505 Maybe some ideas there help you already for your current problem (use long checkpoint intervals).
A related idea to (2) is to write your data with the Avro format, and then regularly use a batch job to transform your data from Avro to Parquet. I hope these are some helpful pointers. I don't have a good overview over other potential open source solutions. Best, Robert On Thu, Sep 10, 2020 at 5:10 PM Marek Maj <marekm...@gmail.com> wrote: > Hello Flink Community, > > When designing our data pipelines, we very often encounter the requirement > to stream traffic (usually from kafka) to external distributed file system > (usually HDFS or S3). This data is typically meant to be queried from > hive/presto or similar tools. Preferably data sits in columnar format like > parquet. > > Currently, using flink, it is possible to leverage StreamingFileSink to > achieve what we want to some extent. It satisfies our requirements to > partition data by event time, ensure exactly-once semantics and > fault-tolerance with checkpointing. Unfortunately, when using bulk writer > like PaquetWriter, that comes with a price of producing a big number of > files which degrades the performance of queries. > > I believe that many companies struggle with similar use cases. I know that > some of them have already approached that problem. Solutions like Alibaba > Hologres or Netflix solution with Iceberg described during FF 2019 emerged. > Given that full transition to real-time data warehouse may take a > significant amount of time and effort, I would like to primarily focus on > solutions for tools like hive/presto backed up by a distributed file > system. Usually those are the systems that we are integrating with. > > So what options do we have? Maybe I missed some existing open source tool? > > Currently, I can come up with two approaches using flink exclusively: > 1. Cache incoming traffic in flink state until trigger fires according to > rolling strategy, probably with some late events special strategy and then > output data with StreamingFileSink. Solution is not perfect as it may > introduce additional latency and queries will still be less performant > compared to fully compacted files (late events problem). And the biggest > issue I am afraid of is actually a performance drop while releasing data > from flink state and its peak character > 2. Focus on implementing batch rewrite job that will compact data offline. > Source for the job could be both kafka or small files produced by another > job that uses plain StreamingFileSink. The drawback is that whole system > gets more complex, additional maintenance is needed and, maybe what is more > troubling, we enter to batch world again (how could we know that no more > late data will come and we can safely run the job) > > I would really love to hear what are community thoughts on that. > > Kind regards > Marek >