Re: Streaming data to parquet

2020-09-14 Thread Senthil Kumar
Arvid, Jan and Ayush, Thanks for the ideas! -Kumar From: Jan Lukavský Date: Monday, September 14, 2020 at 6:23 AM To: "user@flink.apache.org" Subject: Re: Streaming data to parquet Hi, I'd like to mention another approach, which might not be as "flinkish", but rem

Re: Streaming data to parquet

2020-09-14 Thread Jan Lukavský
pache.org>> *Cc: *Marek Maj mailto:marekm...@gmail.com>>, user mailto:user@flink.apache.org>> *Subject: *Re: Streaming data to parquet Hi, Looking at the problem broadly, file size is directly tied up with how often you commit. No matter which system you use, this

Re: Streaming data to parquet

2020-09-14 Thread Arvid Heise
; > > *From: *Ayush Verma > *Date: *Friday, September 11, 2020 at 8:14 AM > *To: *Robert Metzger > *Cc: *Marek Maj , user > *Subject: *Re: Streaming data to parquet > > > > Hi, > > > > Looking at the problem broadly, file size is directly tied up with

Re: Streaming data to parquet

2020-09-11 Thread Senthil Kumar
appreciate any ideas etc. Cheers Kumar From: Ayush Verma Date: Friday, September 11, 2020 at 8:14 AM To: Robert Metzger Cc: Marek Maj , user Subject: Re: Streaming data to parquet Hi, Looking at the problem broadly, file size is directly tied up with how often you commit. No matter which

Re: Streaming data to parquet

2020-09-11 Thread Ayush Verma
Hi, Looking at the problem broadly, file size is directly tied up with how often you commit. No matter which system you use, this variable will always be there. If you commit frequently, you will be close to realtime, but you will have numerous small files. If you commit after long intervals, you

Re: Streaming data to parquet

2020-09-11 Thread Robert Metzger
Hi Marek, what you are describing is a known problem in Flink. There are some thoughts on how to address this in https://issues.apache.org/jira/browse/FLINK-11499 and https://issues.apache.org/jira/browse/FLINK-17505 Maybe some ideas there help you already for your current problem (use long checkp

Streaming data to parquet

2020-09-10 Thread Marek Maj
Hello Flink Community, When designing our data pipelines, we very often encounter the requirement to stream traffic (usually from kafka) to external distributed file system (usually HDFS or S3). This data is typically meant to be queried from hive/presto or similar tools. Preferably data sits in c