Arvid, Jan and Ayush,
Thanks for the ideas! -Kumar
From: Jan Lukavský
Date: Monday, September 14, 2020 at 6:23 AM
To: "user@flink.apache.org"
Subject: Re: Streaming data to parquet
Hi,
I'd like to mention another approach, which might not be as "flinkish", but
rem
pache.org>>
*Cc: *Marek Maj mailto:marekm...@gmail.com>>, user mailto:user@flink.apache.org>>
*Subject: *Re: Streaming data to parquet
Hi,
Looking at the problem broadly, file size is directly tied up with
how often you commit. No matter which system you use, this
;
>
> *From: *Ayush Verma
> *Date: *Friday, September 11, 2020 at 8:14 AM
> *To: *Robert Metzger
> *Cc: *Marek Maj , user
> *Subject: *Re: Streaming data to parquet
>
>
>
> Hi,
>
>
>
> Looking at the problem broadly, file size is directly tied up with
appreciate any ideas etc.
Cheers
Kumar
From: Ayush Verma
Date: Friday, September 11, 2020 at 8:14 AM
To: Robert Metzger
Cc: Marek Maj , user
Subject: Re: Streaming data to parquet
Hi,
Looking at the problem broadly, file size is directly tied up with how often
you commit. No matter which
Hi,
Looking at the problem broadly, file size is directly tied up with how
often you commit. No matter which system you use, this variable will always
be there. If you commit frequently, you will be close to realtime, but you
will have numerous small files. If you commit after long intervals, you
Hi Marek,
what you are describing is a known problem in Flink. There are some
thoughts on how to address this in
https://issues.apache.org/jira/browse/FLINK-11499 and
https://issues.apache.org/jira/browse/FLINK-17505
Maybe some ideas there help you already for your current problem (use long
checkp
Hello Flink Community,
When designing our data pipelines, we very often encounter the requirement
to stream traffic (usually from kafka) to external distributed file system
(usually HDFS or S3). This data is typically meant to be queried from
hive/presto or similar tools. Preferably data sits in c