Re: BucketingSink capabilities for DataSet API

Andrey Zagrebin Thu, 25 Oct 2018 05:06:01 -0700

Hi Rafi,

At the moment I do not see any support of Parquet in DataSet API except 
HadoopOutputFormat, mentioned in stack overflow question. I have cc’ed Fabian 
and Aljoscha, maybe they could provide more information.


Best,
Andrey

> On 25 Oct 2018, at 13:08, Rafi Aroch <rafi.ar...@gmail.com> wrote:
> 
> Hi,
> 
> I'm writing a Batch job which reads Parquet, does some aggregations and 
> writes back as Parquet files.
> I would like the output to be partitioned by year, month, day by event time. 
> Similarly to the functionality of the BucketingSink.
> 
> I was able to achieve the reading/writing to/from Parquet by using the 
> hadoop-compatibility features.
> I couldn't find a way to partition the data by year, month, day to create a 
> folder hierarchy accordingly. Everything is written to a single directory.
> 
> I could find an unanswered question about this issue: 
> https://stackoverflow.com/questions/52204034/apache-flink-does-dataset-api-support-writing-output-to-individual-file-partit
>  
> <https://stackoverflow.com/questions/52204034/apache-flink-does-dataset-api-support-writing-output-to-individual-file-partit>
> 
> Can anyone suggest a way to achieve this? Maybe there's a way to integrate 
> the BucketingSink with the DataSet API? Another solution?
> 
> Rafi

Re: BucketingSink capabilities for DataSet API

Reply via email to