Hi Marek,

what you are describing is a known problem in Flink. There are some
thoughts on how to address this in
https://issues.apache.org/jira/browse/FLINK-11499 and
https://issues.apache.org/jira/browse/FLINK-17505
Maybe some ideas there help you already for your current problem (use long
checkpoint intervals).

A related idea to (2) is to write your data with the Avro format, and then
regularly use a batch job to transform your data from Avro to Parquet.

I hope these are some helpful pointers. I don't have a good overview over
other potential open source solutions.

Best,
Robert


On Thu, Sep 10, 2020 at 5:10 PM Marek Maj <marekm...@gmail.com> wrote:

> Hello Flink Community,
>
> When designing our data pipelines, we very often encounter the requirement
> to stream traffic (usually from kafka) to external distributed file system
> (usually HDFS or S3). This data is typically meant to be queried from
> hive/presto or similar tools. Preferably data sits in columnar format like
> parquet.
>
> Currently, using flink, it is possible to leverage StreamingFileSink to
> achieve what we want to some extent. It satisfies our requirements to
> partition data by event time, ensure exactly-once semantics and
> fault-tolerance with checkpointing. Unfortunately, when using bulk writer
> like PaquetWriter, that comes with a price of producing a big number of
> files which degrades the performance of queries.
>
> I believe that many companies struggle with similar use cases. I know that
> some of them have already approached that problem. Solutions like Alibaba
> Hologres or Netflix solution with Iceberg described during FF 2019 emerged.
> Given that full transition to real-time data warehouse may take a
> significant amount of time and effort, I would like to primarily focus on
> solutions for tools like hive/presto backed up by a distributed file
> system. Usually those are the systems that we are integrating with.
>
> So what options do we have? Maybe I missed some existing open source tool?
>
> Currently, I can come up with two approaches using flink exclusively:
> 1. Cache incoming traffic in flink state until trigger fires according to
> rolling strategy, probably with some late events special strategy and then
> output data with StreamingFileSink. Solution is not perfect as it may
> introduce additional latency and queries will still be less performant
> compared to fully compacted files (late events problem). And the biggest
> issue I am afraid of is actually a performance drop while releasing data
> from flink state and its peak character
> 2. Focus on implementing batch rewrite job that will compact data offline.
> Source for the job could be both kafka or small files produced by another
> job that uses plain StreamingFileSink. The drawback is that whole system
> gets more complex, additional maintenance is needed and, maybe what is more
> troubling, we enter to batch world again (how could we know that no more
> late data will come and we can safely run the job)
>
> I would really love to hear what are community thoughts on that.
>
> Kind regards
> Marek
>

Reply via email to